Open avn3r-dn opened 4 years ago
I don't think the cluster_selectin_epsilon is properly integrated with the soft cluster membership right now -- and realistically I don't expect it to be anytime soon. Sorry.
No problem
Thanks for reply @lmcinnes. If not too much trouble I trying to cluster cnn image features. I have lots of images and lots of classes. Around 1m+ images and 1k+ classes. But classes have varying size min_cluster_size=4 max_cluster_size=1000. I actually want to partition the data not cluster it perse but I don't know K and I can't afford to for loop to find K. So I need a partitioning algo that doesn't require K and has varying density and can handle lots of classes and huge differences in cluster sizes. I can't get hdbscan to give me reasonable results for this type of problem noise tends to be 25-50% of the data and/or all the data is mostly on one big cluster.
Any suggestions appreciated. I try changing all methods, metrics, leaf, epsilon, ...
The best I can offer is to use the leaf clustering with the cluster_selection_epsilon
as you were and then try computing centroids for the clusters (there is a recently added weighted_centroid
method that you can pass a cluster_id to that does this somewhat intelligently. Given that you can partition K-means style around those centroids. This isn't great, but it is likely the best you manage given all the constraints.
I am having the same issue, or at least a related one. Only in my case it appears that the soft_clusters
output is generally inconsistent with the clusters being assigned, and as noted above, often indicates a nonzero probability of being a member of a cluster that does not exist in labels_
when cluster_selection_epsilon is explicitly set. Even when that arg is not used, the argmax
of probabilities generated by soft_clusters
is often not the same as the cluster that is actually assigned.
Is this something that I could reasonably help fix? I.e. do you know the issue and it's just a matter of finding time to fix it, or is the underlying issue still mysterious?
@avn3r-dn You mentioned that you only observed this issue when using cluster_selection_method='leaf'
, however I'm experiencing the same thing even when using cluster_selection_method='eom'
. Were you using some other non-default parameters?
Have a look at #398 for one solution to this issue.
When I use
cluster_selection_method='leaf'
andcluster_selection_epsilon=eps > 0.0
hdbscan.all_points_membership_vectors(clusterer)
produces wrong soft clusters.For example
clusterer.labels_
resulted in 14 clusters + noise so I expect my soft_clustering to have(N, 14)
shape but it has(N, 90)
shape because it's doing soft_clustering assumingepsilon=0
. I can confirm it only happened with I use'leaf'
method which I need.