scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.81k stars 507 forks source link

Speed issues on large datasets #587

Open TopoKunst opened 1 year ago

TopoKunst commented 1 year ago

Hello,

Recently I have tried to perform clustering on a a large dataset(700k samples and 300 dimensions), and I find HDBSCAN relatively slow, not finishing the algorithm after 30 minites.

Given that HDBSCAN cannot parallize its computation thoroughly, I decide to process the dataset in order to enable it fit the algorithm. There are 2 directions: performing dimension reduction v.s. sampling the data.

I am wondersing which method is better for HDBSCAN, for both efficiency and performance. And, is there any other solutions that I can have a try?

Thank you very much.