Handling large amount of data

tdebatty / spark-knn-graphs

Spark algorithms for building k-nn graphs

MIT License

41 stars 15 forks source link

Handling large amount of data #6

Closed zhassanbey closed 6 years ago

zhassanbey commented 7 years ago

The main problem in KNN calculation is the amount data that needs to be broadcasted to each node. For instance, if you have 20 M vectors (~1.3TB), it is really hard to broadcast them in memory. How would you handle this situation? Thank you in advance.

tdebatty commented 7 years ago

Do you mean foor k-nn graphs in general, or specifically for this project?

Le lun. 15 mai 2017 à 12:02, zhassanbey notifications@github.com a écrit :

The main problem in KNN calculation is the amount data that needs to be broadcasted to each node. For instance, if you have 20 M vectors (~1.3TB), it is really hard to broadcast them in memory. How would you handle this situation? Thank you in advance.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tdebatty/spark-knn-graphs/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1SDKxRjXoaP3XmfjbXKgNT1C_I3gm3ks5r6CKsgaJpZM4Na64R .

zhassanbey commented 7 years ago

Hi thank you for reply. I mean specifically for this project. But if you have ideas to share, it would be great.

tdebatty commented 7 years ago

Hello,

I don't know if you found a solution for your problem? Just thought of this now: Spark was actually designed for processing datasets that can fit into the distributed memory. Hence for your kind of problem I would recommend to:

preprocess or filter your data to reduce its size or
use some other processing framework, like Hadoop Map Reduce

Hope this can help...

Best regards,

Le mar. 16 mai 2017 à 05:48, zhassanbey notifications@github.com a écrit :

Hi thank you for reply. I mean specifically for this project. But if you have ideas to share, it would be great.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/tdebatty/spark-knn-graphs/issues/6#issuecomment-301667755, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1SDDsNxhiKc0YxUqYC0cqrzx4fWCphks5r6Rx0gaJpZM4Na64R .

zhassanbey commented 7 years ago

Hi tdebatty, Thank you for reply. I am currently doing it by separating the input into chunks. Well, I first "chunkify" the files, then find knn for each chunk by broadcasting the chunk only. After the knn is calculated for the current chunk of files in iteration, I persist that result with MEMORY_DISK strategy. When all the chunks have processed, I try to merge them. The idea works somehow, but I am seeking for more elegant solution.