Recently I have tried to perform clustering on a a large dataset(700k samples and 300 dimensions), and I find HDBSCAN relatively slow, not finishing the algorithm after 30 minites.
Given that HDBSCAN cannot parallize its computation thoroughly, I decide to process the dataset in order to enable it fit the algorithm. There are 2 directions: performing dimension reduction v.s. sampling the data.
I am wondersing which method is better for HDBSCAN, for both efficiency and performance. And, is there any other solutions that I can have a try?
Hello,
Recently I have tried to perform clustering on a a large dataset(700k samples and 300 dimensions), and I find HDBSCAN relatively slow, not finishing the algorithm after 30 minites.
Given that HDBSCAN cannot parallize its computation thoroughly, I decide to process the dataset in order to enable it fit the algorithm. There are 2 directions: performing dimension reduction v.s. sampling the data.
I am wondersing which method is better for HDBSCAN, for both efficiency and performance. And, is there any other solutions that I can have a try?
Thank you very much.