mhahsler / dbscan

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
GNU General Public License v3.0
304 stars 64 forks source link

hdbscan, distance matrix #35

Open kmzapp opened 5 years ago

kmzapp commented 5 years ago

Currently the complete distance matrix is computed in the hdbscan function. Is it possible that parts of it are computed and used sequentially for the mutual reachability distance such that it could be stored in smaller objects? I currently get an error message about too large vector size when using the function on a large dataset.

mhahsler commented 5 years ago

I think this would be a nice feature to have. I will refer this to Matt.

peekxc commented 5 years ago

I would love to have this as well. One could probably precompute the core distances only, and then change the MST code to compute the mutual reachability distances on demand. I can't remember if there was a reason for not doing that in the first place.

But I'm open to suggestions, there's probably a better way. @kmzapp did you have any other ideas on how to actually achieve this algorithmically?

kmzapp commented 5 years ago

Thank you for the quick reply. I was thinking it might be either possible to compute it on demand or to store it somehow differently that it does not create one too large object. But I do not have a precise idea how to achieve that.