scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.78k stars 497 forks source link

HDBSCAN Mutual Reachability MST(ext) nondeterministic? #64

Open peekxc opened 8 years ago

peekxc commented 8 years ago

I've been reading over the paper to try to gain an understanding of the algorithm, but I've run into a bit of a problem. After exporting the mutual reachability distance matrix from HDBSCAN, running other MST implementations creates graphs that are not completely equivalent to the one produced by scikit here. After some investigation, I found that there were in fact non-unique distances in the reachability graph, which it seems prevents MST algorithms from arriving at a unique solution. How does HDBSCAN handle this?

@lmcinnes

lmcinnes commented 8 years ago

There are actually potentially a lot of non-unqiue distances, so the MST is certainly not guaranteed unique. The different MSTs should provide equivalent single linkage trees however (I would have to sit down and work through the details to be sure, but I believe this is the case), and thus result in identical clusterings.