scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 496 forks source link

hdbscan how to fix this scene ? #168

Open Garfiled opened 6 years ago

Garfiled commented 6 years ago

import hdbscan

points = []

/ (close point1 point2 point3) (point4) (close point5 point6 point7)
/ points.append([116.286932,40.055431]) points.append([116.286905,40.055411]) points.append([116.286905,40.055421])

points.append([116.289789,40.055859])

points.append([116.289789,40.055865])

points.append([116.291487,40.056122]) points.append([116.291549,40.056191]) points.append([116.291567,40.056177])

clusterer = hdbscan.HDBSCAN(min_cluster_size=2,metric='haversine') cluster_labels = clusterer.fit_predict(points)

points = [str(p[0])+","+str(p[1]) for p in points]

print ";".join(points) print cluster_labels

The results: 116.286932,40.055431;116.286905,40.055411;116.286905,40.055421;116.289789,40.055865;116.291487,40.056122;116.291549,40.056191;116.291567,40.056177 [0 0 0 1 1 1 1]

I just want to ask why hdbscan fix the label [0 0 0 1 1 1 1] why not [0 0 0 -1 1 1 1] or [0 0 0 1 2 2 2]

lmcinnes commented 6 years ago

Ultimately it is because the two major cluster split before the singleton point split (i.e. the singleton point was closer to a cluster than the clusters were to each other). Things can go a little oddly with density based clustering with so little data, since it gives very little clear indication of the actual densities.