scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 503 forks source link

Counter-intuitive noise points #577

Open erooke opened 1 year ago

erooke commented 1 year ago

I have been playing with hdbscan to try and build an intuition for what it is doing. Currently I am running into counter-intuitive behavior when running it on synthetic data. In particular I have been running hdbscan on data sampled evenly from a circle. My understanding of the algorithm suggests it should return a single cluster similar to what dbscan would do with the proper epsilon setting. However, hdbscan is instead identifying a single cluster and a collection of noise points. If I reduce the minimum number of points needed for a cluster below 4 the noise points vanish. Due to the symmetry of the data I'm not seeing why this parameter should make much of a difference on how the clustering works. I'm curious if my intuition is way off or if there is an issue with how I am invoking hdbscan.

Code:

from hdbscan import HDBSCAN
import matplotlib.pyplot as plt
import numpy as np

min_cluster_size = 4 # Minimum size to cause an issue
samples = 100

theta = np.linspace(-np.pi, np.pi, samples, endpoint=False)
data = np.zeros((samples, 2))
data[:,0] = np.cos(theta)
data[:,1] = np.sin(theta)

clusterer = HDBSCAN(min_cluster_size=min_cluster_size, allow_single_cluster=True)
clusterer.fit(data)

labels = set(clusterer.labels_)
for label in labels:
    cluster = data[clusterer.labels_ == label]
    plt.scatter(cluster[:,0], cluster[:,1])

plt.axis("equal")
plt.show()

Expected output: image

Actual output: image (note the orange noise points in the lower right)

System Information: python version: 3.10.8 hdbscan version: 0.8.29

lucetka commented 1 year ago

That's interesting, I'd also like to have an explanation. Setting min_samples to 3 leads to expected results. image I personally always set min_samples explicitly and that's important especially when using large min_cluster_size, because the default value, ie a value equal to min_cluster_size, would lead to ultra-conservative clustering. However, that's not the case here.

Perhaps it has something to do with float accuracy causing the intended equidistant steps not be truly equidistant? With theta = np.linspace(-np.pi, np.pi, samples, endpoint=False): image

With theta = np.linspace(-3.14, 3.14, samples, endpoint=False): image