Closed jwijffels closed 2 years ago
hdbscan needs to compute a minimum spanning tree (MST) on the mutual reachability matrix (which is calculated from the distance matrix). What we would need is a way to go from the data directly to the MST without storing the whole distance/mutual reachability matrix for at least Euclidean distance. I am not quite sure how to do that... Ideas?
Initially I thought that this could have been covered with some bigmemory backend or even altrep but probably there exists smarter ways. I should probably have a look more in detail to the mutual reachability matrix calculation (https://github.com/mhahsler/dbscan/blob/master/src/mrd.cpp#L6) before I can provide you with ideas.
I'm trying out an algorithm for clustering texts called top2vec implemented by @michalovadek This algorihm first applies doc2vec on texts to get document embeddings, next reduces the dimensionality of these embeddings to a lower dimensional space using
uwot::umap
after whichdbscan::hdbscan
is applied to find clusters. When trying this out on a corpus with approximately 50000 documents, this fails in the call ofdist
in the call tohdbscan
when passing a 2D matrix. A reproducible example is shown below with some fake data. Is there a way that hdbscan can handle more rows to cluster upon (possibly related to issue https://github.com/mhahsler/dbscan/issues/35)