Producing parent clusters from HDBSCAN clustering

artmatsak commented 2 years ago

Given a HDBSCAN clustering, we'd like to merge some of the clusters to produce parent clusters. The ultimate goal is to have two-level clustering. A promising approach would be to iteratively merge the clusters that are closest together. It looks like the condensed tree is a good starting point. The documentation states:

The question now is what does the cluster hierarchy look like – which clusters are near each other, or could perhaps be merged, and which are far apart.

That's exactly what we need but I have the following questions:

Given the condensed tree, how do we identify the two clusters that are closest together and that should be merged on the next step?
To kick off the agglomeration process, how do we identify the clusters ultimately selected by HDBSCAN in the condensed tree? (Is CondensedTree._select_clusters() the way?)

Thank you!

EquinoxElahin commented 2 years ago

Up Especially for point 2, did you find how to do it?

lmcinnes commented 2 years ago

Yes, CondensedTree._select_clusters() selects out the clusters -- it returns the ids in the tree of the clusters that would get selected. One caveat: if you are using fancier selection approaches such as the cluster selection epsilon it will not account for that: it only understands leaf clustering and eom at the moment.

If you want to map those selected cluster ids to the labels in the clusterer.labels_ it is simply a matter of sorting them numerically -- cluster label n refers to the nth cluster_id in the sorted list.

scikit-learn-contrib / hdbscan

Producing parent clusters from HDBSCAN clustering #545