Relationships between labels_, outlier_scores_ and probabilities_

michaelaye commented 7 years ago

On a simple make_blob test-dataset I'm getting these different arrays for outlier_scores_, `labels', and 'probabilities', and I cannot find anything in the docs that precise the relationship between these:

screenshot 2017-01-08 18 06 16

Interestingly, index 9 has been found to be an outlier, however it has not received a negative label (which I thought is the marker for not belonging to a cluster?) and indeed a probability of 0.8, which I find highly confusing. How can it be an outlier, but at the same time belong to a cluster?

The other question is, it looks like that negative label numbers are assigned to anything with a probability lower than 0.1, is that true? Or a coincidence?

michaelaye commented 7 years ago

Ok, I got the story between probabilities and labels: When labels are considered noisy, they get both label = -1 and probability = 0.0, always!? Correct?

lmcinnes commented 7 years ago

They each have somewhat different meanings. The outlier score is an implementation of GLOSH, and it is important to note that it catches local outliers as well as just points that are far away from everything (indeed, this is an important and powerful aspect of it). Thus a point can be "in" a cluster, and have a label, but be sufficiently far from an otherwise very dense core that is is anomalous in that local region of space (i.e. it is weird to have a point there when almost everything else is far more tightly grouped).

The probabilties are slightly misnamed and I really should change it. It is essentially a "cluster membership score" ... that is, if the point is in a cluster how relatively well tied to the cluster is it? It is effectively the ratio of the distance scale at which this point fell out of the cluster with the distance scale of the core of the cluster. In due course I would like to replace this with the upcoming soft clustering probabilities.

And yes, any noise points are assigned a probability 0 as it is the "membership score" for the cluster that they are a member of, and noise points are not a member of any cluster.

scikit-learn-contrib / hdbscan

Relationships between labels_, outlier_scores_ and probabilities_ #80