Open JunFang-NWPU opened 4 years ago
You need to get a mapping from cluster labels as output to ids in the condensed tree. From there is is just a matter of following the tree (finding a parent that has this node as a child, or looking at the children on this node). If you look through the code of get_clusters
in hdbscan.plots
you can see one approach to getting this sort of label mapping.
Yes, the key is the mapping from labesl to ids in the condensed tree. Thanks. I will give a try.
@lmcinnes would you please be able to share a small snippet that does this? it would be super helpful! 🙏🏽
there are a few open issues regarding this cluster mapping and sub cluster topic - https://github.com/scikit-learn-contrib/hdbscan/issues/451, https://github.com/scikit-learn-contrib/hdbscan/issues/442
Maybe this snipped will be helpful
import numpy as np
def find_all_subclusters(clusterer, cluster_labels):
tree = clusterer.condensed_tree_
tree_df = tree.to_pandas()
def get_subclusters(node):
children = tree_df[tree_df['parent'] == node]
# If there are no children or only leaf children, return the node itself
if children.empty or all(children['child'] < len(cluster_labels)):
return {node: list(children[children['child'] < len(cluster_labels)]['child'])}
# Recursively get subclusters for non-leaf children
subclusters = {}
for _, child in children.iterrows():
if child['child'] >= len(cluster_labels):
subclusters.update(get_subclusters(child['child']))
else:
subclusters[node] = subclusters.get(node, []) + [child['child']]
return subclusters
all_subclusters = {}
unique_labels = np.unique(cluster_labels)
for label in unique_labels:
if label != -1: # Exclude noise points
cluster_points = np.where(cluster_labels == label)[0]
cluster_node = tree_df[tree_df['child_size'] == len(cluster_points)]['child'].iloc[0]
all_subclusters[label] = get_subclusters(cluster_node)
return all_subclusters
# Assuming you have already run HDBSCAN and have cluster_labels
# clusterer = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=15)
# cluster_labels = clusterer.fit_predict(data)
all_subclusters = find_all_subclusters(clusterer, cluster_labels)
for cluster_label, subclusters in all_subclusters.items():
print(f"\nCluster {cluster_label}:")
total_points = 0
for subcluster, points in subclusters.items():
print(f" Subcluster {subcluster}: {len(points)} points")
total_points += len(points)
print(f"Total points in all subclusters: {total_points}")
Hi,
I obtained some clusters using hdbscan. Some clusters contain too many points, and some contain too little. I know I can find sub-clusters or super-cluster of a cluster from its hierarchy (e.g., condense_tree), but it seems it is not an easy task.
Can someone show how to achieve it? Input a cluster label, and found its sub-clusters and super-cluster.
Thanks.