scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.79k stars 500 forks source link

Sum of membership_vector not equal to 1 #246

Open codata-hg opened 5 years ago

codata-hg commented 5 years ago

Hi, I have a clusterer trained with many clusters identified. I used

hdbscan.prediction.membership_vector(clusterer, points_to_predict)

to get the probability distribution of the points over all clusters. I was expecting the sum of all membership score in one vector is equal to one. But it's not. Why is that?

Thanks

lmcinnes commented 5 years ago

The sum will be the probability that the point is in any cluster. Since HDBSCAN considers some points "noise" you can think of this as one minus the probability that the point is noise. Hopefully that is helpful.

On Thu, Oct 25, 2018 at 6:32 PM codata-hg notifications@github.com wrote:

Hi, I have a clusterer trained with many clusters identified. I used

hdbscan.prediction.membership_vector(clusterer, points_to_predict)

to get the probability distribution of the points over all clusters. I was expecting the sum of all membership score in one vector is equal to one. But it's not. Why is that?

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/246, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBaIJ3zVVCwIhf8ALmAUnjHY9X6Byks5uojvlgaJpZM4X7OgN .

codata-hg commented 5 years ago

Thanks for your timely response. It makes sense! But I also had some wired observations. I saw some samples in the core of clusters with probabilities_=1, but when I use it for prediction with membership_vector(), sometimes I got zero probability in that particular cluster it belongs to, but non-zero for the rest.

Also, I do the same testing on samples in noise. Some are normal, with pretty low sum of probabilities, which means it's dissimilar from all clusters; But there are some samples giving sum of probabilities close to 1, like [0.3, 0.3 0.3]. Any thought on this?

lmcinnes commented 5 years ago

Sadly there are some bugs in the soft cluster membership. It works fine for some datasets, but can get messed up badly at times. I have plans for a grand re-write at some point, so haven't really tracked exactly what is astray. Sorry.

brusberg commented 2 years ago

Hello @lmcinnes, I just wanted to check if the bug related to soft cluster membership that would effect membership_vector is fixed?

djaym7 commented 2 years ago

+1 checking in again to see if its fixed else adding a #featureRequest for it

simonpedrogonzalez commented 2 years ago

I think I have the same issue. I expected 1 - probabilities_ + all_points_membership_vectors.sum(axis=1) == 1. For some reason, this is not always the case, from time to time I get values significantly grater than 1, like 1.07. By the way, I love your work, @lmcinnes

lmcinnes commented 2 years ago

I'm not currently maintaining the soft clustering anymore as I have too many other things on my plate. PRs are welcome however.