Open ogrisel opened 1 year ago
Here is a relevant gist of what could be a pytorch drop-in replacement for the kneighbors
method:
https://gist.github.com/fcharras/82772cf7651e087b3b91b99105a860dd
Self quoting myself in the k-means thread::
to my knowledge the best brute force implementations require materializing the pairwise distance matrix in memory and can't go farther than the IO bottleneck, so the speedup one can get is more limited, and the pytorch implementation should be decently close from the best you can get.
It would be interesting to compare with cuML and if cuML is much faster than this PyTorch GPU implementation of brute force kNN, then it might be interesting to see if we can get similar performance with Triton based implementation.
Forked your original gist and added some basic cuml comparison: https://gist.github.com/betatim/68219c95f539df51afad96cd9cd14a1c
On a machine with 8 Tesla V100 (32GB RAM), 80 Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (from looking at /proc/cpuinfo
) and 1TB RAM I get about 8s for the torch implementation and about 1s for the cuml option. With 5M samples in the data, so more than the original gist used.
On a second run I got 6s and 0.33s respectively. Seems to fluctuate a bit.
Have you tried to set CUDA_VISIBLE_DEVICES=0
to make sure that none of the two implementation leverages the fact that you have multiple GPU devices on the benchmark machine?
This issue is a sibling of a similar issue for k-means: #26585 with similar purpose but likely different constraints.
In particular an efficient implementation of k-NN on the GPU would require:
torch.cdist
torch.topk
being discussed at: