scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.81k stars 507 forks source link

Soft Clustering with Leaf Method #360

Open avn3r-dn opened 4 years ago

avn3r-dn commented 4 years ago

When I use cluster_selection_method='leaf' and cluster_selection_epsilon=eps > 0.0 hdbscan.all_points_membership_vectors(clusterer) produces wrong soft clusters.

For example clusterer.labels_ resulted in 14 clusters + noise so I expect my soft_clustering to have (N, 14) shape but it has (N, 90) shape because it's doing soft_clustering assuming epsilon=0. I can confirm it only happened with I use 'leaf' method which I need.

lmcinnes commented 4 years ago

I don't think the cluster_selectin_epsilon is properly integrated with the soft cluster membership right now -- and realistically I don't expect it to be anytime soon. Sorry.

avn3r-dn commented 4 years ago

No problem

Thanks for reply @lmcinnes. If not too much trouble I trying to cluster cnn image features. I have lots of images and lots of classes. Around 1m+ images and 1k+ classes. But classes have varying size min_cluster_size=4 max_cluster_size=1000. I actually want to partition the data not cluster it perse but I don't know K and I can't afford to for loop to find K. So I need a partitioning algo that doesn't require K and has varying density and can handle lots of classes and huge differences in cluster sizes. I can't get hdbscan to give me reasonable results for this type of problem noise tends to be 25-50% of the data and/or all the data is mostly on one big cluster.

Any suggestions appreciated. I try changing all methods, metrics, leaf, epsilon, ...

lmcinnes commented 4 years ago

The best I can offer is to use the leaf clustering with the cluster_selection_epsilon as you were and then try computing centroids for the clusters (there is a recently added weighted_centroid method that you can pass a cluster_id to that does this somewhat intelligently. Given that you can partition K-means style around those centroids. This isn't great, but it is likely the best you manage given all the constraints.

shadowk29 commented 4 years ago

I am having the same issue, or at least a related one. Only in my case it appears that the soft_clusters output is generally inconsistent with the clusters being assigned, and as noted above, often indicates a nonzero probability of being a member of a cluster that does not exist in labels_ when cluster_selection_epsilon is explicitly set. Even when that arg is not used, the argmax of probabilities generated by soft_clusters is often not the same as the cluster that is actually assigned.

Is this something that I could reasonably help fix? I.e. do you know the issue and it's just a matter of finding time to fix it, or is the underlying issue still mysterious?

apcamargo commented 4 years ago

@avn3r-dn You mentioned that you only observed this issue when using cluster_selection_method='leaf', however I'm experiencing the same thing even when using cluster_selection_method='eom'. Were you using some other non-default parameters?

sabarish-akridata commented 4 years ago

Have a look at #398 for one solution to this issue.