scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 497 forks source link

hdbscan and sparse precomputed distance matrix #636

Open KukumavMozolo opened 3 months ago

KukumavMozolo commented 3 months ago

Hi!,

so i am working at the following problem i have millions of sparse data points that are very high dimensional.

Using a sparse precomputed distance matrix seems one way to feed this data into hdbscan.

My current idea is to only store those distances that are below a certain threshold or use a fixed number of distances for every point and than ensuring that there are no disconnected components in the resulting graph. How do hdbscan's hyperparameters interact with the required level of sparsity of that matrix. e.g. given a fixed min_cluster_size, min_samples and cluster_selection_epsilon how would that constrain the threshold or the number of distances per point so that the resulting clustering is no different from when providing the full distance matrix?

sa2329 commented 3 weeks ago

@KukumavMozolo any success with this? My (limited) experience with this is that hdbscan does not support sparse matrices for the distance matrix.

KukumavMozolo commented 3 weeks ago

Unfortunately I gave up on it for now and just uniformly down-sampled the data:(