scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 502 forks source link

Closest clusters are not consistent with the cluster labels #123

Open AndrewNg opened 7 years ago

AndrewNg commented 7 years ago

It seems that the closestclusters are labeled with a different numbering system as compared with cluster.labels. I ran the example code on some data and expected the closest_cluster to match up with the label when a given data point had a label (i.e., the data point was not noisy and was not -1). However, the labels did not match up: image

The expectation is that a data point with label 4 will also have closest cluster 4. After going through the rest of the data, label 4 and closest cluster 3 map to the same cluster, but they are numbered inconsistently.

AndrewNg commented 7 years ago

After some more testing, it looks like the label numbers are reversed. Here's a mapping of the closest_cluster label to label:

defaultdict(None, {3: 4, 5: 2, 6: 1, 2: 5, 0: 7, 4: 3, 1: 6, 7: 0})

lmcinnes commented 7 years ago

That's a little disconcerting. I'll see if I can track down where things are getting reversed - -I suspect there is some duplicated/copy-paste code on my part that got updated in one place and not the other. sorry about that.

CoffeRobot commented 7 years ago

I noticed some similar when using the soft clustering. Let's say the algorithm finds 2 clusters, if you print the probabilities computed using hdbscan.all_points_membership_vectors(hdbscan_clusterer) it happens a scenario where the probabilities are inverted. For instance I noticed something similar to the following:

datapoint_id label probs 1 1 [0.8,0.2] 2 0 [0.3,0.7] 3 1 [0.7,0.3] 4 0 [0.1,0.9]

Working with my data I noticed something else that might help in finding the problem. If I print out the cluster ids using hdbscan_clusterer.condensedtree._select_clusters() the ids are not sorted when I have the problem while when the labels correspond to the right probabilities the select_cluster method outputs a sorted array of ids. I hope this could help.

I can also save in a cvs some toy data I have to reproduce this problem if you like.

vivekbharadwaj commented 7 years ago

Hey @lmcinnes @AndrewNg . Any updates on this issue? I'm faced with a similar problem on my real world dataset with no apparent relation between probabilities and cluster labels (neither are the probabilities indices matching with cluster labels nor are they inverted). Is there a workaround to derive cluster label mapping based on the soft clustering probabilities?

However, I couldn't reproduce the same issue in the example code (digits dataset in the link above). Cluster index from membership vector (from numpy.argmax(all_points_membership_vectors)) is now equal to cluster labels in the digits dataset for all non-noise data points, so I assume some work has been done on this? Puzzling!?!

vivekbharadwaj commented 7 years ago

Its worth mentioning that when I ran the clustering algorithm on a different market segment, the cluster labels did match the index of the membership vector probabilities.

after some more analysis on the previous dataset with the mismatch problem, I discovered a cyclic pattern with the sorting. The index of the highest probabilities in the membership vector start with 29 instead of 0 : screen shot 2017-10-13 at 1 35 25 pm

Happy to pm you a pickle if it helps you reproduce the problem...

lmcinnes commented 7 years ago

Sorry, I have been very busy with a number of other projects, and this was relatively low on the priority list (I was hoping to significantly overhaul the soft clustering at some point, and get to this then). I probably won't have time to get to this that soon either. I believe the problem should be relatively easy fix -- one needs to compare the cluster selection code from _hdbscan_tree.pyx and from the soft clustering and make sure they actually align properly. I would be more than happy to accept a PR, but can`t promise to get to this myself for a little while.

vivekbharadwaj commented 7 years ago

Thanks for your prompt reply Leland. I'm unable to work on it since I'm travelling until next week. Might give it a go once I'm back.

gilgtc commented 6 years ago

@lmcinnes @AndrewNg Any progress on this? I just ran into this issue as well...

lmcinnes commented 6 years ago

I believe this did get fixed actually, but due to some other patches elsewhere that intersected with this. What version of hdbscan are you running?

gilgtc commented 6 years ago

I have the latest version I believe: hdbscan-0.8.13

I just ran it again to make sure and it still has the same behavior as described by OP

lmcinnes commented 6 years ago

Hmm, let me take a look again.

gilgtc commented 6 years ago

@lmcinnes thanks for taking a look at this Leland. I am very eager to use this functionality and appreciate your time and effort.

lmcinnes commented 6 years ago

I have a proposed fix -- let me know if the current master resolves the issue for you.

gilgtc commented 6 years ago

@lmcinnes I only had a short time to try but it seems that I still get the same behavior. I will try it again tonight on a simpler case and let you know. In the meantime, if anyone @AndrewNg could try it as well it would be helpful.

gilgtc commented 6 years ago

@lmcinnes @AndrewNg

I ran the soft clustering example and still got some mixed results. Most of the cluster labels from clusterer.labels_ match the index of the top probability in hdbscan.all_points_membership_vectors(clusterer), but there are still a few which don't. Specifically, out of 814 data points, 798 are correctly identified but 16 are incorrect which is a bit disconcerting. See full example below:

from sklearn import datasets
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

digits = datasets.load_digits()
data = digits.data
projection = TSNE().fit_transform(data)
plt.scatter(*projection.T,**plot_kwds)

clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0 else (0.5, 0.5, 0.5) for x in clusterer.labels_]

cluster_member_colors = [sns.desaturate(x, p) for x, p in zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)

soft_clusters = hdbscan.all_points_membership_vectors(clusterer)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[np.argmax(x)] for x in soft_clusters]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_colors, alpha=0.25)

num_wrong = 0
num_right = 0
for c, sc in zip(clusters, soft_clusters):  
  if c > 0:
    if (c-np.argmax(sc)) != 0:
      num_wrong += 1
      print('(%d, %d)' %(c, np.argmax(sc)))
    else:
      num_right += 1

print('num_right = %d, num_wrong = %d' % (num_right, num_wrong))

The result is as follows (it only shows which didn't match correctly and at the end shows the total count for correct and incorrect:

(8, 7) (1, 3) (5, 11) (1, 3) (1, 6) (1, 6) (3, 8) (1, 6) (10, 11) (6, 10) (4, 2) (3, 6) (4, 9) (9, 10) (4, 9) (1, 6) num_right = 798, num_wrong = 16

lmcinnes commented 6 years ago

Thanks for the example. Unfortunately it looks like I'm not going to have time to dig into this until Tuesday. Hopefully it can wait until then, at which point I'll try to get into this properly and see if I can figure out what on earth is going astray.

gilgtc commented 6 years ago

@lmcinnes no worries, thanks for taking a look.

lmcinnes commented 6 years ago

Digging in to this I think the answer (unfortunately?) is that this is "just how it works". The soft clustering considers the distance from exemplars, and the merge height in the tree between the point and each of the clusters. These points that end up "wrong" are points that sit on a split in the tree -- they have the same merge height to their own cluster (perhaps that is a bug, I'll look into it further). That means tree-wise we don't distinguish them, and in terms of pure ambient distance to exemplars they are closer to the "wrong" cluster, and so get misclassified. This is a little weird, but the soft clustering is ultimately a little different that the hard clustering, so corner cases like this can theoretically occur.

gilgtc commented 6 years ago

@lmcinnes Thanks for looking at it, that makes sense. I'll keep an eye on this thread but unfortunately, at least as it is now, I don't think I will be able to use it because in my data set the number of "wrong" clusters is pretty high.

lmcinnes commented 6 years ago

I understand. I have plans for a different clustering algorithm that is more amenable to producing soft clustering via something along these lines, but likely rather more robustly. Sorry I couldn't be of more help at this time.

gilgtc commented 6 years ago

@lmcinnes Cool, i look forward to that. Best of luck.

ricsinaruto commented 5 years ago

Any updates on this algorithm @lmcinnes, is this incosistent labeling still an issue? Thanks!

lmcinnes commented 5 years ago

Unfortunately my time has been rapidly soaked up by other projects (largely UMAP), so I haven't had the opportunity to sit down and code up the new algorithm as I would desire it to be yet. I believe some fixes were put in place that *should address the inconsistent labelling, but I haven't actually checked, so I cant make any promises.

mik1904 commented 4 years ago

Hello, any updates on the labelling issue in soft clustering? @lmcinnes @gilgtc @AndrewNg Thank you!

lmcinnes commented 4 years ago

Not as yet, sorry.

fgg1991 commented 3 years ago

I met a data set, which nearly 90% data has different soft & hard cluster label....do we have any update since last year?

irvintim commented 3 years ago

Since it appears this method isn't getting worked on, is there another method out there of determining the next best match for hdbscan data that people are using?