Open KukumavMozolo opened 3 months ago
@KukumavMozolo any success with this? My (limited) experience with this is that hdbscan does not support sparse matrices for the distance matrix.
Unfortunately I gave up on it for now and just uniformly down-sampled the data:(
Hi!,
so i am working at the following problem i have millions of sparse data points that are very high dimensional.
Using a sparse precomputed distance matrix seems one way to feed this data into hdbscan.
My current idea is to only store those distances that are below a certain threshold or use a fixed number of distances for every point and than ensuring that there are no disconnected components in the resulting graph. How do hdbscan's hyperparameters interact with the required level of sparsity of that matrix. e.g. given a fixed
min_cluster_size
,min_samples
andcluster_selection_epsilon
how would that constrain the threshold or the number of distances per point so that the resulting clustering is no different from when providing the full distance matrix?