[FEA] request for HDBSCAN clustering

rapidsai / cuml

cuML - RAPIDS Machine Learning Library

https://docs.rapids.ai/api/cuml/stable/

Apache License 2.0

4.21k stars 530 forks source link

[FEA] request for HDBSCAN clustering #1783

Closed mvss80 closed 2 years ago

mvss80 commented 4 years ago

Are there any plans to get HDBSCAN implemented?

https://github.com/scikit-learn-contrib/hdbscan

cjnolet commented 4 years ago

HDBSCAN is on our radar, but it is not on our roadmap currently.

That being said, it certainly helps us with future prioritization to know that individuals in the community are interested in this.

noob-procrastinator commented 3 years ago

Any development regarding HDBSCAN? Using UMAP+ HDBSCAN is quite the wonder!

cjnolet commented 3 years ago

@noob-procrastinator there has absolutely been progress on HDBSCAN. We are currently in the process of integrating single-linkage hierarchical clustering (https://github.com/rapidsai/raft/pull/140) as a core building block to HDBSCAN. The major challenge here was being able to scale these algorithms with a knn graph to lower the n^2 memory requirement.

I have a draft PR open for HDBSCAN so you can track progress: https://github.com/rapidsai/cuml/pull/3546

cjnolet commented 3 years ago

@noob-procrastinator,

An experimental verison of HDBSCAN was merged into version 21.06 and we are continuing to improve it in 21.08 and beyond. Please let us know as you find bugs and whether there are any missing options / features you'd like to use.

versis commented 3 years ago

tl;dr @cjnolet does your implementation requires more memory than the original hdbscan implementation (https://hdbscan.readthedocs.io)?

My data is few million data points in 768 dim space. I also used UMAP to reduce dim, but it wasn't enough.

I was benchmarking both agglomerative clustering from fastcluster lib and hdbscan. AC was much faster, but required too much ram. HDBSCAN used just a little memory, but required much more time. RAPIDSAI seems to be much faster, but is the memory usage the same?

cjnolet commented 3 years ago

@versis,

The current HDBSCAN implementation in cuML uses brute-force nearest neighbors instead of having to materialize the full pairwise distance matrix in memory. We also built a primitive to connect the KNN graph in order to get the exact solution (and guarantee convergence of the MST.) We are planninng to use approximate nearest neighbors in a future release to speed up the computation further but the brute-force is pretty great for the initial version.

The GoogleNews Word2Vec dataset (3Mx300) takes ~22min on my V100 (32gb). We are going to work to make it even faster, but that's still very good considering the Scikit-learn contrib version had to be stopped after a day of running. Also, our initial version supports only the base set of features in the algorithm but please let us know if there are missing features you'd like to see such as the fuzzy clustering or inference on out of sample datapoints.

versis commented 3 years ago

@cjnolet Thank you for quick response. 22 min seems promising! We will take a look, although inference is must have feature for us ;/

sparkdoc commented 3 years ago

Will you be including the soft clustering capability? https://hdbscan.readthedocs.io/en/latest/soft_clustering.html#soft-clustering-for-hdbscan

beckernick commented 2 years ago

@dantegd @cjnolet does it make sense to close this issue and instead file separate issues for different requests related o HDBSCAN (soft clustering, different algorithms, inference, prediction_data, etc.)? Or perhaps collect various individual requests in this issue

cjnolet commented 2 years ago

@beckernick, yep I definitely think it makes sense to close this issue.

I also want to link relevant issues here for reference:

soft clustering #4467
prediction / inference: #4448
different distance measures (and additional planned features): #3879
speeding up algorithm w/ 3-15 dimensions: #4161