Closed mvss80 closed 2 years ago
HDBSCAN is on our radar, but it is not on our roadmap currently.
That being said, it certainly helps us with future prioritization to know that individuals in the community are interested in this.
Any development regarding HDBSCAN? Using UMAP+ HDBSCAN is quite the wonder!
@noob-procrastinator there has absolutely been progress on HDBSCAN. We are currently in the process of integrating single-linkage hierarchical clustering (https://github.com/rapidsai/raft/pull/140) as a core building block to HDBSCAN. The major challenge here was being able to scale these algorithms with a knn graph to lower the n^2
memory requirement.
I have a draft PR open for HDBSCAN so you can track progress: https://github.com/rapidsai/cuml/pull/3546
@noob-procrastinator,
An experimental verison of HDBSCAN was merged into version 21.06 and we are continuing to improve it in 21.08 and beyond. Please let us know as you find bugs and whether there are any missing options / features you'd like to use.
tl;dr @cjnolet does your implementation requires more memory than the original hdbscan implementation (https://hdbscan.readthedocs.io)?
My data is few million data points in 768 dim space. I also used UMAP to reduce dim, but it wasn't enough.
I was benchmarking both agglomerative clustering from fastcluster lib and hdbscan. AC was much faster, but required too much ram. HDBSCAN used just a little memory, but required much more time. RAPIDSAI seems to be much faster, but is the memory usage the same?
@versis,
The current HDBSCAN implementation in cuML uses brute-force nearest neighbors instead of having to materialize the full pairwise distance matrix in memory. We also built a primitive to connect the KNN graph in order to get the exact solution (and guarantee convergence of the MST.) We are planninng to use approximate nearest neighbors in a future release to speed up the computation further but the brute-force is pretty great for the initial version.
The GoogleNews Word2Vec dataset (3Mx300) takes ~22min on my V100 (32gb). We are going to work to make it even faster, but that's still very good considering the Scikit-learn contrib version had to be stopped after a day of running. Also, our initial version supports only the base set of features in the algorithm but please let us know if there are missing features you'd like to see such as the fuzzy clustering or inference on out of sample datapoints.
@cjnolet Thank you for quick response. 22 min seems promising! We will take a look, although inference is must have feature for us ;/
Will you be including the soft clustering capability? https://hdbscan.readthedocs.io/en/latest/soft_clustering.html#soft-clustering-for-hdbscan
@dantegd @cjnolet does it make sense to close this issue and instead file separate issues for different requests related o HDBSCAN (soft clustering, different algorithms, inference, prediction_data, etc.)? Or perhaps collect various individual requests in this issue
@beckernick, yep I definitely think it makes sense to close this issue.
I also want to link relevant issues here for reference:
Are there any plans to get HDBSCAN implemented?
https://github.com/scikit-learn-contrib/hdbscan