lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
879 stars 105 forks source link

PyNNDescent on GPU #136

Open sgbaird opened 3 years ago

sgbaird commented 3 years ago

@lmcinnes

Saw your twitter response back in January about GPU implementations.

I've been implementing a cuda.jit Numba implementation of the Wasserstein distance. Ran into lots of issues with replacing functions like np.zeros, np.concatenate (anything that dynamically allocates memory), and ended up copying manual implementations (sort of for-loop C-style implementations) for e.g. np.argsort. Not so bad for Euclidean. The main issue is the (very) limited support of NumPy. I recently started learning Numba, so an expert might have a different opinion or be able to do it more easily. I was able to get Euclidean distance running on the GPU with Numba, but am still running into various issues with something more involved like Wasserstein.

For example, I haven't been able to get local array allocation working with cuda.local.array and have generally run into problems with slicing. An alternative is to pass everything in as arguments that are initialized at the top level, which is more work than just using overloaded NumPy functions that I write. I think what that means is if I want to operate on two vectors and their concatenated version, I need to pass all three in as arguments at the host (CPU) level. If I use np.diff which returns something of a different length, I also need to pass that in as an argument. The code will start to look a bit ugly, especially with nested functions - for Wasserstein I probably need about 5 additional input arguments. I'd rather be able to get it to work with cuda.local.array first and check the performance, but I might need to jump straight to this option.

Happy to share what I've tried and where I'm at if this is of interest.

lmcinnes commented 3 years ago

This sounds like great work -- I never imagined this would be easy at all, and it sounds like you have made a great deal of progress. I don't think it is unreasonable to have GPU support for only some subset of metrics (and, potentially, not supporting sparse input data, as that is definitely not going to play well in terms of dynamically allocated arrays, concatenates and so on). I would love to see what you have that works with euclidean distance. I would hope that cosine / angular distance should be manageable as well, and that would cover a vast amount of use cases for many people. Something like Wasserstein is just very hard.

sgbaird commented 3 years ago

I've got a working implementation of this. It's quite fast based on other options I've seen. 5k x 5k Wasserstein distance matrix in ~18 s on the GPU (including a near-negligible compile time). Also have a njit'd version that's fairly fast (~1 min?). Something like a 200x speedup from scipy.stats.wasserstein_distance as a custom metric to cdist. I'd like to get it incorporated into PyNNDescent, possibly through pip or conda dependency for importing the functions. Cosine distance should be a cinch to add. Euclidean CUDA was ~4x faster than cdist if I remember correctly.

sgbaird commented 3 years ago

Where in the code are distances calculated in "batches" (e.g. with for loops)? I think this will primarily be where the speedup occurs.

lmcinnes commented 3 years ago

That is the big catch. There are very few places where distances are computed in large batches. Almost everything is in relatively small batches. In terms of the main loop of the algorithm (as opposed to initialization) it is in generate_graph_updates (https://github.com/lmcinnes/pynndescent/blob/ee915d2599c1fd36e13ae4ac15ba91ecee503c86/pynndescent/pynndescent_.py#L160) that the work happens. The logic is a little involved since we are comparing new with everything (new and old) but only comparing old with new. Mostly, however, you can see it is nested for loops, with a numba parallel outer loop. Ideally it will be possible to GPUise that, but I'm not sure how easy that will be.

sgbaird commented 3 years ago

The only thing with generate_graph_updates that seems like it would have issues with the GPU version (other than of course having GPU-compatible dist functions) is that updates takes an append method. Is the shape of updates unknown at runtime?

lmcinnes commented 3 years ago

Yes, but it is probably possible to make a decent estimate of an upperbound, so as long as memory is relatively cheap you can overallocate and leave a sentinel value at the end. That does leave issues plumbing it back into the rest of the code, but I suspect a little wrapping of the actual GPU compute might work. Does that seem reasonable to you?

sgbaird commented 2 years ago

I know I'm getting back pretty late on this - but yes, that does seem reasonable. This has been on my mind the last few months as I've been working on publishing mat_discover. Here's the distance matrix implementation I've been using: https://github.com/sparks-baird/dist-matrix. I think that even without using the GPU facilities, Wasserstein could be sped up by using the njit Wasserstein distance.

To know if it's worth running on the GPU, I'd probably need to estimate how many distance calculations are done in a single call to generate_graph_updates for some reasonably large problem(s). Do you have any examples that come to mind? (e.g. 1 million points with Euclidean distance, 20000 points with Wasserstein, etc.).

Calculating the full distance matrix is fairly straightforward, so I don't know if there is a (hacky) workaround of precomputing the distance matrix and accessing elements of the matrix rather than recomputing. Obviously, this would break down for problems where the full pairwise distance matrix can't fit into memory.

The dist_matrix function is designed to handle the calculation of sparse distance matrices as well. Give it a list of index pairs and you get a list of distances that correspond to each pair.

lmcinnes commented 2 years ago

The number on a single call is often surprisingly small; anywhere from 400 to 3600 total distance computations (i.e. between 20 and 60 points all to all). That may be less good for GPU architectures. It seems like vector quantization or inverted index style algorithms might be more amenable to these sorts of GPU oriented approaches.