Open cjnolet opened 5 years ago
I presume the Euclidean distance is used now, the same as the scikit-learn? It's a bit confusing that the cuML docs explicitly mention DBSCAN's sensitivity to the distance metric, but it doesn't specify which distance metric is actually used or allows to specify the distance metric.
It would be nice if different distance metrics could be used for DBSCAN.
I presume the Euclidean distance is used now, the same as the scikit-learn? It's a bit confusing that the cuML docs explicitly mention DBSCAN's sensitivity to the distance metric, but it doesn't specify which distance metric is actually used or allows to specify the distance metric.
It would be nice if different distance metrics could be used for DBSCAN.
+1 for other distance metrics, especially cosine, for DBSCAN.
@mvss80, Since Euclidean distance is used currently, cosine distance can be supported today by normalizing your vectors to unit norm. In the meantime, we can certainly work to add cosine & L1 distance.
@bittremieux, please forgive my delayed response. I will update the DBSCAN Python API to match sklearn's metric
argument. For now, I will add euclidean
as the default.
Our pairwise distances API now supports several distances in addition to Euclidean, L1, and Cosine so I think it should be possible to match scikit-learn more closely now, and also update HDBSCAN in addition.
I am currently working on a comprehensive survey that includes all of the scipy/sklearn distance and loads more.T he purpose of this survey is to be able to implement all of the distance functions as efficiently as possible on GPUs.
Once some of these algorithms are defined, we need to expose them through the DBSCAN C++ API and then allow them to be set as options through the Cython API.
At this present moment, there are other distance metrics already implemented in the distance primitives that need to be exposed through the C++ API and further through the Cython layer.