FR: Use Multiprocessing for HDBSCAN

PhilipMay commented 3 years ago

Hi, when I run HDBSCAN I see that only one CPU core is used at 100%. Would it be possible to implement HDBSCAN in a way that it uses multiple cores? Thanks Philip

ydennisy commented 3 years ago

+1

I saw the performance notebook, but I am not seeing this in practice - my run in also using just a single core 6% CPU usage of a 16 core machine, and has been running a dataset of 2MM rows and 128dims for over 24 hours.

Is there anyway to check that HDBSCAN is running at its optimal settings?

lmcinnes commented 3 years ago

A major key for performance is getting the feature space under 50 dimensions. Without that it has to use fallback algorithms which are much slower. So if you can do PCA or similar down to 40 or so dimensions; or UMAP down to say 10 dimensions; then things will likely run much faster.

In terms of multiprocessing -- the algorithm is not parallelisable (at least not trivially) so multicore/multiprocessing is not really tractable. An entirely different algorithmic approach would be required to enable that. This is something I have given some thought to, but would be part of a newer clustering library and approach should it ever get implemented.

scikit-learn-contrib / hdbscan

FR: Use Multiprocessing for HDBSCAN #434