rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.2k stars 525 forks source link

[FEA] DBSCAN to support same distance metrics as sklearn #212

Open cjnolet opened 5 years ago

cjnolet commented 5 years ago

I am currently working on a comprehensive survey that includes all of the scipy/sklearn distance and loads more.T he purpose of this survey is to be able to implement all of the distance functions as efficiently as possible on GPUs.

Once some of these algorithms are defined, we need to expose them through the DBSCAN C++ API and then allow them to be set as options through the Cython API.

At this present moment, there are other distance metrics already implemented in the distance primitives that need to be exposed through the C++ API and further through the Cython layer.

bittremieux commented 5 years ago

I presume the Euclidean distance is used now, the same as the scikit-learn? It's a bit confusing that the cuML docs explicitly mention DBSCAN's sensitivity to the distance metric, but it doesn't specify which distance metric is actually used or allows to specify the distance metric.

It would be nice if different distance metrics could be used for DBSCAN.

mvss80 commented 4 years ago

I presume the Euclidean distance is used now, the same as the scikit-learn? It's a bit confusing that the cuML docs explicitly mention DBSCAN's sensitivity to the distance metric, but it doesn't specify which distance metric is actually used or allows to specify the distance metric.

It would be nice if different distance metrics could be used for DBSCAN.

+1 for other distance metrics, especially cosine, for DBSCAN.

cjnolet commented 4 years ago

@mvss80, Since Euclidean distance is used currently, cosine distance can be supported today by normalizing your vectors to unit norm. In the meantime, we can certainly work to add cosine & L1 distance.

@bittremieux, please forgive my delayed response. I will update the DBSCAN Python API to match sklearn's metric argument. For now, I will add euclidean as the default.

cjnolet commented 3 years ago

Our pairwise distances API now supports several distances in addition to Euclidean, L1, and Cosine so I think it should be possible to match scikit-learn more closely now, and also update HDBSCAN in addition.