Speed issues on large datasets

Hello,

Recently I have tried to perform clustering on a a large dataset(700k samples and 300 dimensions), and I find HDBSCAN relatively slow, not finishing the algorithm after 30 minites.

Given that HDBSCAN cannot parallize its computation thoroughly, I decide to process the dataset in order to enable it fit the algorithm. There are 2 directions: performing dimension reduction v.s. sampling the data.

I am wondersing which method is better for HDBSCAN, for both efficiency and performance. And, is there any other solutions that I can have a try?

Thank you very much.

scikit-learn-contrib / hdbscan

Speed issues on large datasets #587