src-d / kmcuda

Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA
Other
791 stars 145 forks source link

Is it possible to find the nearest neighbor of a new data point #22

Closed gammaturn closed 7 years ago

gammaturn commented 7 years ago

Great library, thanks a lot!

I would like to use it for solving the following problem: I have two sets of points e.g. in 3D-space. With the first set I perform a kmeans-clustering and then I would like to identify for each point in the second set its nearest neighbor in the first. As far as I understand, the array "samples" in knn_cuda is the same as in kmeans_cuda and the function returns the nearest neighbors within "samples". I could not see a way to provide a second array.

I hope that I did not miss something obvious. Any help would be highly appreciated.

vmarkovtsev commented 7 years ago

Sorry for the lag.

Yep, samples must be k-means-ed, but it is easily addressable using a 3-stage pipeline:

  1. Cluster your first set of points using kmeans_cuda.
  2. Supply the resulting centroids as cluster centers to kmeans_cuda with samples from the second set and run for 0 iterations - that is, only calculate the assignments.
  3. Take the second set and run knn_cuda using the assignments from step 2.

Actually, you can do step 2 without even applying kmeans_cuda.

gammaturn commented 7 years ago

Thanks a lot for the reply!

I am afraid that I was not able to completely follow your suggestions. I must admit that my C/C++ knowledge (let alone cuda) is quite limited and I intend to use the library from Python. How would I e.g. set the number of iterations to 0 in this case? I was unable to find anything in the API-documentation.

Specifically, I would like to to solve problems similar to the following, but replace scikit-learn (or flann) with your functions.

import numpy as np
from sklearn.neighbors import NearestNeighbors

np.random.seed(0)
arr_1 = np.random.randn(500, 2).astype(np.float32)
arr_2 = np.random.randn(200, 2).astype(np.float32)

neigh = NearestNeighbors().fit(arr_1)
distances, indices = neigh.kneighbors(arr_2)

print(distances.shape)
print(distances[:5,:])     # distances to points in arr_1

print(indices.shape)
print(indices[:5,:])     # indices in arr_1

That would result in:

(200, 5)
[[ 0.04925391  0.06528084  0.16510354  0.16571016  0.20100214]
 [ 0.03884278  0.07760898  0.07846053  0.09056447  0.09334525]
 [ 0.07785564  0.12316782  0.12821217  0.13993794  0.15464237]
 [ 0.08776998  0.22432552  0.23194346  0.26785866  0.27977358]
 [ 0.01887633  0.03208937  0.0578081   0.08504257  0.09562645]]

(200, 5)
[[306  59 484 258 341]
 [157 173 231  17  89]
 [113 275 308 451 208]
 [285 102  92 188 375]
 [244 472 221 182 433]]

Any hints, how I could reproduce that computation using libKMCUDA?

vmarkovtsev commented 7 years ago
from libKMCUDA import kmeans_cuda, knn_cuda
import numpy as np

np.random.seed(0)
arr_1 = np.random.randn(500, 2).astype(np.float32)
arr_2 = np.random.randn(200, 2).astype(np.float32)

centroids, _ = kmeans_cuda(arr_1, arr_1.shape[0] // 10, yinyang_t=0)
_, asses = kmeans_cuda(arr_2, arr_1.shape[0] // 10, init=centroids, tolerance=1.0, yinyang_t=0)

neighbors = knn_cuda(5, arr_2, centroids, asses)

500x2 is too small for libKMCUDA. 5000000x2 is. Besides, knn_cuda performs much slower on K which exceeds the shared memory size (let's say, 10).

gammaturn commented 7 years ago

Thank you very much for the code! But knn_cuda will still give me the nearest neighbors within arr_2. I am afraid that I misrepresented my problem from the very beginning, but I need to identify the closest point(s) in the other array. I quickly plotted the points in arr_1 and arr_2and the question is: For each orange point, which is the closest blue point?

image

This is, what scikit-learn provides with:

neigh = NearestNeighbors().fit(arr_1)
distances, indices = neigh.kneighbors(arr_2)

I assume that knn_cuda would then need both arrays to answer this question.

Sorry for the confusion!

vmarkovtsev commented 7 years ago

All right! Sorry, I misunderstood you. Yeah, this is an API fault. Currently, it is not possible.

I can recommend you https://github.com/facebookresearch/faiss A bit trickier to setup but should work exactly how you need.

gammaturn commented 7 years ago

Thanks for the hint!