rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.23k stars 532 forks source link

[FEA] NearestNeighbors support for ball radius queries #262

Open cjnolet opened 5 years ago

cjnolet commented 5 years ago

DBSCAN is already doing this, as it uses the eps value to determine the the edges of the adjacency graph. It should not be too hard to perform the same batching that DBSCAN is performing, whilst thresholding the distance matrix and returning the resulting vectors.

teju85 commented 5 years ago

This sounds the same as kNN!? Not sure I got you here @cjnolet. Can you elaborate, please?

cjnolet commented 5 years ago

Rather than it returning the closest neighbors, it is only returning the neighbors that are within a specific distance. For instance, we are doing this on DBSCAN already w/ the eps parameters (basically filtering the pairwise distance matrix based on some epsilon).

https://en.wikipedia.org/wiki/Neighbourhood_(mathematics)

You can accomplish this with the k-nn distance matrix if you set k = n, for instance, but it's something that, to my knowledge, does not exist yet (in reusable form) in cuml.

cjnolet commented 5 years ago

Here's the sklearn API for NearestNeighbors (which is the sklearn equivalent for cuml's current kNN implementation): https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors

Notice the methods kneighbors() and radius_neighbors().

teju85 commented 5 years ago

ah.. got it. Thanks Corey for the explanation.

cjnolet commented 5 years ago

We now have a prim for building adjacency matrices out of epsilon neighborhoods:

size_t epsilon_neighborhood(T *a, T *b, bool *adj, int m, int n, int k, T eps,
            void *workspace, size_t worksize, cudaStream_t stream)