Open erooke opened 1 year ago
That's interesting, I'd also like to have an explanation. Setting min_samples to 3 leads to expected results. I personally always set min_samples explicitly and that's important especially when using large min_cluster_size, because the default value, ie a value equal to min_cluster_size, would lead to ultra-conservative clustering. However, that's not the case here.
Perhaps it has something to do with float accuracy causing the intended equidistant steps not be truly equidistant? With theta = np.linspace(-np.pi, np.pi, samples, endpoint=False):
With theta = np.linspace(-3.14, 3.14, samples, endpoint=False):
I have been playing with hdbscan to try and build an intuition for what it is doing. Currently I am running into counter-intuitive behavior when running it on synthetic data. In particular I have been running hdbscan on data sampled evenly from a circle. My understanding of the algorithm suggests it should return a single cluster similar to what dbscan would do with the proper epsilon setting. However, hdbscan is instead identifying a single cluster and a collection of noise points. If I reduce the minimum number of points needed for a cluster below 4 the noise points vanish. Due to the symmetry of the data I'm not seeing why this parameter should make much of a difference on how the clustering works. I'm curious if my intuition is way off or if there is an issue with how I am invoking hdbscan.
Code:
Expected output:
Actual output: (note the orange noise points in the lower right)
System Information: python version: 3.10.8 hdbscan version: 0.8.29