[QST] Performance of NearestNeighbours kneighbours lookup

thomasaarholt commented 3 years ago

What is your question? Is the NearestNeighbors.kneighbors problem well-suited for GPU? I find it quite slow for big data on the GPU. I've ran the following on a single RTX2080TI and a 32-core AMD processor. The .fit() method is very fast, but the .kneighbors() method is not. See below of some context and benchmarks. Note that in #4020 I show the other GPU algorithms crash the kernel.

We can use:

from cuml.neighbors import NearestNeighbors
model = NearestNeighbors()
model.fit(points)
model.kneighbors(new_points)

to find the neighbors of new_points among points, if they both are arrays of coordinates (with shape (N, 2)).

While the creation and fitting of the model to the data is very quick with cuML, the kneighbours lookup is relatively slow when compared with sklearn's CPU implementation:

import numpy as np, cupy as cp
from sklearn.neighbors import NearestNeighbors as NearestNeighborsCPU
from cuml.neighbors import NearestNeighbors as NearestNeighborsGPU
from time import time

n_neighbors = 10
n_samples = 1_000_000
n_unknown = 100_000

X = 1000*np.random.random((n_samples, 2))
unknown = 1000*np.random.random((n_unknown, 2))

print('Sklearn results')
for algo in ['auto', 'ball_tree', 'kd_tree']: # 'brute' takes a *really* long time, so we skip it
    t1 = time()
    knn = NearestNeighborsCPU(n_neighbors=n_neighbors, algorithm=algo)
    knn.fit(X);
    t2 = time()
    estimates = knn.kneighbors(unknown)
    print(f"{algo}: {t2-t1:.2f} and {time()-t2:.2f}")

print(\n'cuML results')
X = cp.asarray(X)
unknown = cp.asarray(unknown)

for algo in ['brute']: # see cuML #4020 for why I don't include other algorithms
    t1 = time()
    knn = NearestNeighborsGPU(n_neighbors=n_neighbors,  algorithm=algo)
    knn.fit(X);
    t2 = time()
    estimates = knn.kneighbors(unknown)
    print(f"{algo}: {t2-t1:.2f} and {time()-t2:.2f}")

Doing a simple with "medium" and "large" number of samples, I see the following results:

# times in seconds, for [.fit] and [.kneighbors]
# n_samples = 1_000_000
# n_unknown = 100_000
Sklearn results
auto: 0.95 and 0.62
ball_tree: 0.91 and 1.40
kd_tree: 0.97 and 0.70

cuML results
brute: 0.01 and 2.19

# n_samples = 10_000_000
# n_unknown = 1_000_000
Sklearn results
auto: 39.92 and 31.47
ball_tree: 36.40 and 43.35
kd_tree: 38.36 and 14.45

cuML results
brute: 0.03 and 200.59

teju85 commented 3 years ago

@mdoijade can you take a look at this and see why kneighbors is so slow?

dantegd commented 3 years ago

@teju85 wouldn't the difference be because this is comparing different algorithms? i.e. brute in GPU (which takes too long in CPU) vs kd_tree/ball_tree?

teju85 commented 3 years ago

:smh: good catch @dantegd! I overlooked the array in line 14. Yes, this is indeed because the comparison is being made with brute-force against indexing methods. We don't yet have indexing based methods implemented.

thomasaarholt commented 3 years ago

Right, I should have made that clearer. Just in case this wasn't noticed either, I created #4020, pointing out that the other GPU algorithms crash python on my system.

cjnolet commented 3 years ago

@thomasaarholt, as @dantegd pointed out, the differences here are strictly related to the underlying algorithms. While it may increase training time, we do support the ivpq algorithm, which should significantly minimize the query time at least.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapidsai / cuml

[QST] Performance of NearestNeighbours kneighbours lookup #4021