scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 501 forks source link

Simple 3 blobs not separatable #68

Closed michaelaye closed 8 years ago

michaelaye commented 8 years ago

I'm surprised that I cannot seem to separate these blobs easily in 3 clusters?

example_blobs

I looped over all reasonable min_cluster_size values (until 30, with 90 samples for these 3 clusters) and only min_cluster_size=3 and min_samples_size=1 brought anything larger than 2, but it created 11, so far too extreme.

Here's the example notebook for debugging:

https://gist.github.com/9bf06c2b796b93771da85b57785009b5

michaelaye commented 8 years ago

FYI: My data looks often like this, with around 20 measurements per location, with potentially big scatter (citizen science measurements). If there's anything else I could tweak to make these kind of data separate in three clusters, i'd be happy to know!). Thanks for your package!

lmcinnes commented 8 years ago

As far as HDBSCAN, or, indeed, any density based clustering algorithm, is concerned there are only two clusters there -- the grey and white blobs have sufficient overlap that they can't be distinguished since there isn't any significant area of notably lower density between them. With more samples creating denser cores, so that there is appreciable difference in density between the centers of the blobs and their area of overlap, you can probably separate the clusters.

I think the short answer is that given your constraints (not that many samples, potential for overlap on the fringes), this isn't the right clustering algorithm for you. In a sense there isn't quite enough data for the density to separate things, so you really need to supply some other information. If you have some prior knowledge of the distribution of your clusters (are they, say, expected to be gaussian?) then that might suffice, and I would recommend dirichlet process gaussian mixture models: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html .

Sorry I can't be more help.

michaelaye commented 7 years ago

So, i tried the DPGMM algos but they just fail, i guess because 5-10 samples per location are just not enough to distinguish it from noise. I actually do have more information for clustering. Imagine these dots only to be center points of more complex drawings, with each of them for example have an angle (=orientation) in the field, and two radius values (ellipses). But what if I would like to allow less precision on the radii then on the location of the center? Is it possible to control that somehow?

lmcinnes commented 7 years ago

I believe the short answer is that you simply don't have enough data to "cluster". At best you can partition the data, but you really need to know how many parts to break it into, and even then noise can potentially dominate/cause issues with so small a dataset.

I think you need to find a way to make use of the greater amount of information you have -- if each data point is really a center of potentially some gaussian distribution (ellipsoidal via the covariance) you might want to consider a kernel density estimate by approaximating the gaussians (use the radii and info on the ellipse to estimate the covariance matrix) and summing them all up. That might give you something to clustering, or at least threshold in a semi-sensible way.