scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.78k stars 497 forks source link

How to get sub-clusters or super-cluster of a cluster? #401

Open JunFang-NWPU opened 4 years ago

JunFang-NWPU commented 4 years ago

Hi,

I obtained some clusters using hdbscan. Some clusters contain too many points, and some contain too little. I know I can find sub-clusters or super-cluster of a cluster from its hierarchy (e.g., condense_tree), but it seems it is not an easy task.

Can someone show how to achieve it? Input a cluster label, and found its sub-clusters and super-cluster.

Thanks.

lmcinnes commented 4 years ago

You need to get a mapping from cluster labels as output to ids in the condensed tree. From there is is just a matter of following the tree (finding a parent that has this node as a child, or looking at the children on this node). If you look through the code of get_clusters in hdbscan.plots you can see one approach to getting this sort of label mapping.

JunFang-NWPU commented 4 years ago

Yes, the key is the mapping from labesl to ids in the condensed tree. Thanks. I will give a try.

salman1993 commented 3 years ago

@lmcinnes would you please be able to share a small snippet that does this? it would be super helpful! 🙏🏽

there are a few open issues regarding this cluster mapping and sub cluster topic - https://github.com/scikit-learn-contrib/hdbscan/issues/451, https://github.com/scikit-learn-contrib/hdbscan/issues/442

eamag commented 2 months ago

Maybe this snipped will be helpful

import numpy as np

def find_all_subclusters(clusterer, cluster_labels):
    tree = clusterer.condensed_tree_
    tree_df = tree.to_pandas()

    def get_subclusters(node):
        children = tree_df[tree_df['parent'] == node]

        # If there are no children or only leaf children, return the node itself
        if children.empty or all(children['child'] < len(cluster_labels)):
            return {node: list(children[children['child'] < len(cluster_labels)]['child'])}

        # Recursively get subclusters for non-leaf children
        subclusters = {}
        for _, child in children.iterrows():
            if child['child'] >= len(cluster_labels):
                subclusters.update(get_subclusters(child['child']))
            else:
                subclusters[node] = subclusters.get(node, []) + [child['child']]

        return subclusters

    all_subclusters = {}
    unique_labels = np.unique(cluster_labels)

    for label in unique_labels:
        if label != -1:  # Exclude noise points
            cluster_points = np.where(cluster_labels == label)[0]
            cluster_node = tree_df[tree_df['child_size'] == len(cluster_points)]['child'].iloc[0]
            all_subclusters[label] = get_subclusters(cluster_node)

    return all_subclusters

# Assuming you have already run HDBSCAN and have cluster_labels
# clusterer = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=15)
# cluster_labels = clusterer.fit_predict(data)

all_subclusters = find_all_subclusters(clusterer, cluster_labels)

for cluster_label, subclusters in all_subclusters.items():
    print(f"\nCluster {cluster_label}:")
    total_points = 0
    for subcluster, points in subclusters.items():
        print(f"  Subcluster {subcluster}: {len(points)} points")
        total_points += len(points)
    print(f"Total points in all subclusters: {total_points}")