lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
891 stars 105 forks source link

Run on Dask #63

Open tomwhite opened 5 years ago

tomwhite commented 5 years ago

For very large datasets it might be beneficial to run on Dask (although there is a lot of mileage in machines with lots of cores and RAM).

Joblib supports Dask as a backend, however the parallel implementation of pynndescent keeps all heap updates in memory, which may be limiting. A more distributed approach is to use Dask bags to shuffle updates across machines - this is slower but should support larger datasets. I made an initial (incomplete) attempt at this here: https://github.com/tomwhite/pynndescent/tree/dask

lmcinnes commented 5 years ago

I'll try to take a look when I get some time. I have also had a chat with Matt Rocklin about a distributed pynndescent on dask, and he had some good ideas for minimizing shuffles (at the cost of more iterations). Depending on performance costs and how distributed the arrays are that may also be worth pursuing.

jamestwebber commented 3 years ago

Hello again both of you! ๐Ÿ‘‹

Is this still an issue? I've recently been implementing some workflows in Dasks and distributed kNN would be really great because it's one of the slower steps.

lmcinnes commented 3 years ago

The parallelism has since been reworked and runs via numba now. With that said a dask distributed version of pynndescent would be an awesome thing to have -- allowing users to potentially work with datasets larger than memory. I would be happy to try and work with someone with dask expertise who wanted to implement such a thing.

jamestwebber commented 3 years ago

I started checking out @tomwhite's branch but I see that it's pretty outdated at this pointโ€“it uses some functions from the pynndescent.threaded module which is no longer around due to the numba refactor that you mention. I see there's a threaded_rp_trees module with a note about dask in the comment.

I'm far from a dask expert but I have been talking to some of them recently, I could try to get a PR going with some help from @lmcinnes if you can provide guidance?

lmcinnes commented 3 years ago

Hi @jamestwebber, and thanks for being willing to look into this. The threaded rp-trees remain around because it was a template of a plan for how to handle rp-trees if the input data was (potentially distributed) dask array. I think the basic idea there is amenable to a dask style approach. The rest would need some discussion to work out the best way to handle distributed data so that we can try to retain some locality in computation and not just have giant data shuffles going on the whole time. I would happy to try and discuss some ideas for that in more detail soon.

jamestwebber commented 3 years ago

Sounds great. If I could figure out how to contact you, I would try to set up a time (unless you prefer to discuss via Github). But you're very elusive. ๐Ÿ˜‚

lmcinnes commented 3 years ago

you can reach me at leland DOT mcinnes AT gmail DOT com

murilocn commented 3 years ago

Hi. I'm really interested in collaborating on a Dask version of PyNNDescent. Did you develop the initial project?

jamestwebber commented 3 years ago

I wish! Unfortunately I've done basically nothing on this project because of too many other priorities.

But it's still a bottleneck for my workflow so I'd like to solve it. Maybe having a collaborator would provide the spark! I'll send you an email. ๐Ÿ™‚