rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.25k stars 534 forks source link

[FEA] Cosine similarity for HDBSCAN #6123

Open nivibilla opened 4 weeks ago

nivibilla commented 4 weeks ago

Is your feature request related to a problem? Please describe. I would like to use cosine similarity as a metric for HDBSCAN for high dimensional data without dimension reduction as its quite slow to do for large datasets

Describe the solution you'd like Allow clusterer = cuml.cluster.hdbscan.HDBSCAN(min_cluster_size=50, metric='cosine', prediction_data=True)

Describe alternatives you've considered normal DBSCAN supports this but this has issues with clusters of various sizes.

nivibilla commented 4 weeks ago

Trying this currently raises this error

https://github.com/rapidsai/cuml/blob/fb4c8af8ed42159e61b226c7b23a9086306df56e/python/cuml/cuml/cluster/hdbscan/hdbscan.pyx#L838

divyegala commented 2 weeks ago

@nivibilla you could l2-normalize your dataset for now and pass metric=euclidean for equivalent results to cosine distance.