min_cluster_size is actually "max noise size"

Phlya commented 8 years ago

Hi,

Thank you for this amazing clustering algorithm and such easy-to-use library. However I think I've found a minor bug. min_cluster_size keyword actually stands for a maximum size, which is not considered a cluster. See example below:

data['Cluster2'] = hdbscan.HDBSCAN(min_cluster_size=2).fit_predict(data[['x', 'y', 'z']])
data['Cluster3'] = hdbscan.HDBSCAN(min_cluster_size=3).fit_predict(data[['x', 'y', 'z']])
gb2 = data.groupby('Cluster2')
l = np.nan
for n, cluster in gb2:
    l = np.nanmin([l, cluster.shape[0]])
print l
gb3 = data.groupby('Cluster3')
l = np.nan
for n, cluster in gb3:
    l = np.nanmin([l, cluster.shape[0]])
print l

Prints out 3.0 4.0

(I use the latest version available through pip)

lmcinnes commented 8 years ago

Good catch; it's on off by one in the condense tree process. I'll get the fix committed soon, and the next version on pip will have it rolled in.

Thanks.

Phlya commented 8 years ago

Nice, thanks!

scikit-learn-contrib / hdbscan

min_cluster_size is actually "max noise size" #10