scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.81k stars 507 forks source link

all_points_membership_vectors membership values not summing to 1.0 for some data points #568

Closed vdet closed 2 years ago

vdet commented 2 years ago

Dear all,

I first noticed this issue with the cuml implementation of HDBSCAN, but it mirrors the original CPU version.

import hdbscan

ary = [[1.0, 4.0, 4.0], [2.0, 4.0, 4.0], [2.0, 4.1, 4.0], [2.0, 4.0, 4.1], [5.0, 1.0, 1.0], [5.0, 1.1, 1.1], [5.0, 1.1, 1.0], [5.0, 1.0, 1.0], [5.0, 1.0, 1.0]]
hdbscan_float = hdbscan.HDBSCAN(min_samples=2, min_cluster_size=2, prediction_data=True)
hdbscan_float.fit(ary)
hdbscan_float.labels_

produces, as expected

array([0, 0, 0, 0, 1, 1, 1, 1, 1])

Now

hdbscan.all_points_membership_vectors(hdbscan_float)

returns

array([[1.20133644e-001, 2.05858648e-002],
       [1.00000000e+000, 1.48205576e-309],
       [1.00000000e+000, 1.46559246e-309],
       [1.00000000e+000, 1.46559246e-309],
       [            nan,             nan],
       [0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000],
       [            nan,             nan],
       [            nan,             nan]])

All data point are assigned to either cluster 0 or 1. None is assigned to '-1', i.e. noise. Yet the memberships of the first data point sum to 0.14, not 1.0.

Shall 1.0-0.14070807 [=1-sum(cluster membership of first point)] be interpreted as the probability of the first point being a 'noise point'? If so the probability is higher that membership to either cluster 0 or 1, and it seems that point should be labelled '-1'.

The documentation states that 'The return value is a two-dimensional numpy array. Each point of the input data is assigned a vector of probabilities of being in a cluster. ', which does not seem to be the case. Am I missing something?

All the best,

Vincent

aholovenko commented 2 years ago

@vdet I think that https://hdbscan.readthedocs.io/en/latest/soft_clustering_explanation.html#converting-a-conditional-probability might answer your question with the sum to 1 condition.

vdet commented 2 years ago

Thanks!