scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.78k stars 497 forks source link

Add branch detection functionality #648

Closed JelmerBot closed 1 month ago

JelmerBot commented 2 months ago

Dear maintainers,

This pull request adds the branch detection functionality from our preprint (https://arxiv.org/abs/2311.15887). Our main goal was to detect branch hierarchies within already detected clusters. These hierarchies describe cluster shapes, which can reveal subgroups not expressed in the density profile.

I have tried to add the functionality in way that minimises its impact on the codebase. I settled on the pattern used by the prediction module. The main usage pattern now looks like this:

from hdbscan import HDBSCAN, BranchDetector

clusterer = HDBSCAN(branch_detection_data=True).fit(data)
branch_detector = BranchDetector().fit(clusterer)

The BranchDetector class mimics the HDBSCAN class and provides access to labels, membership probabilities, the detected hierarchies, and more. This way, end-users that just want clusters do not have to interact with the branch detection functionality at all.

I needed to make a couple of unrelated changes in Cython code to make all tests pass on my machine. I will try to mark these changes with review comments in the PR. Please advice on whether I should remove these changes from the PR or keep them in.

I hope you will consider merging this PR. Let me know if things need to be fixed / changed to better match your vision for the project.

Kind regards,

Jelmer Bot

review-notebook-app[bot] commented 2 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB