rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.2k stars 528 forks source link

[FEA] Cosine similarity for HDBSCAN #6123

Open nivibilla opened 3 hours ago

nivibilla commented 3 hours ago

Is your feature request related to a problem? Please describe. I would like to use cosine similarity as a metric for HDBSCAN for high dimensional data without dimension reduction as its quite slow to do for large datasets

Describe the solution you'd like Allow clusterer = cuml.cluster.hdbscan.HDBSCAN(min_cluster_size=50, metric='cosine', prediction_data=True)

Describe alternatives you've considered normal DBSCAN supports this but this has issues with clusters of various sizes.

nivibilla commented 3 hours ago

Trying this currently raises this error

https://github.com/rapidsai/cuml/blob/fb4c8af8ed42159e61b226c7b23a9086306df56e/python/cuml/cuml/cluster/hdbscan/hdbscan.pyx#L838