Array API support for k-nearest neighbors models with the brute force method

scikit-learn / scikit-learn

scikit-learn: machine learning in Python

https://scikit-learn.org

BSD 3-Clause "New" or "Revised" License

60k stars 25.39k forks source link

Array API support for k-nearest neighbors models with the brute force method #26586

Open ogrisel opened 1 year ago

ogrisel commented 1 year ago

This issue is a sibling of a similar issue for k-means: #26585 with similar purpose but likely different constraints.

In particular an efficient implementation of k-NN on the GPU would require:

torch.cdist
torch.topk being discussed at:
- https://github.com/data-apis/array-api/issues/629

fcharras commented 1 year ago

Here is a relevant gist of what could be a pytorch drop-in replacement for the kneighbors method:

https://gist.github.com/fcharras/82772cf7651e087b3b91b99105a860dd

Self quoting myself in the k-means thread::

to my knowledge the best brute force implementations require materializing the pairwise distance matrix in memory and can't go farther than the IO bottleneck, so the speedup one can get is more limited, and the pytorch implementation should be decently close from the best you can get.

ogrisel commented 1 year ago

It would be interesting to compare with cuML and if cuML is much faster than this PyTorch GPU implementation of brute force kNN, then it might be interesting to see if we can get similar performance with Triton based implementation.

betatim commented 1 year ago

Forked your original gist and added some basic cuml comparison: https://gist.github.com/betatim/68219c95f539df51afad96cd9cd14a1c

On a machine with 8 Tesla V100 (32GB RAM), 80 Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (from looking at /proc/cpuinfo) and 1TB RAM I get about 8s for the torch implementation and about 1s for the cuml option. With 5M samples in the data, so more than the original gist used.

On a second run I got 6s and 0.33s respectively. Seems to fluctuate a bit.

ogrisel commented 1 year ago

Have you tried to set CUDA_VISIBLE_DEVICES=0 to make sure that none of the two implementation leverages the fact that you have multiple GPU devices on the benchmark machine?