scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 503 forks source link

Algorithm parameter sparsely documented #173

Open ixxie opened 6 years ago

ixxie commented 6 years ago

HDBSCAN's documentation is amazing, really some of the best we have seen and (besides the amazing performance) one of the major factors for us choosing to utilize this library. Thank you for this great library, and its great documentation.

However, likely because HDBSCAN by default uses a heuristic to select the best algorithm for robust single linkage, the differences between these algorithms is not clear. In particular, it is often interesting to coerce an algorithm which permits parallelization to improve performance, but it is hard to know what the tradeoffs are.

A glance at the source code suggests the Boruvka variants are the one's supporting parallelization; do these variants employ some approximations? The source and https://github.com/scikit-learn-contrib/hdbscan/issues/156 also suggests a general preference to KDTree variants over BallTree variants. Otherwise it seems the heuristic chooses based on the metric choice, the sparsity of the input data, and its size. Would this be a correct assessment?

lmcinnes commented 6 years ago

I have to say that RobustSingleLinkage could really use some documentation love. I haven't done anywhere near enough work on that (and should really do an overhaul of some of the code too).

With regard to tradeoffs in the algorithm parameter -- those are reasonably easy to describe, but it would take a little while to write something sensible documentation-wise that makes a good case for the various choices. I'll add it to my queue, but would gladly accept doc PRs if you had the time.

ixxie commented 6 years ago

@lmcinnes - I would totally submit a PR but I have no idea how far off the mark I am with my assertions.

lmcinnes commented 6 years ago

That's fine. I'll see if I can get something documented soon. Thanks for the suggestion, I agree it is certainly something that deserves better documentation.

ixxie commented 6 years ago

Thank you! For now I am coercing HDBSCAN to use Boruvka KDTree in order to ensure concurrent execution; we haven't really noticed any quality drop so I am curious if we would run into issues in the future. One thing we noticed is that it still running only on four cores, although we specify eight are available (although this is a different issue).

lmcinnes commented 6 years ago

I think it is currently capped at 4 cores (there are some seriously diminishing returns for the portion that is parallelizeable). I think issue #160 covers that, with a suggested fix that I haven't had the time to implement yet