allow_single_cluster=True, running time is doubled when exactly one cluster is segmented

scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.

http://hdbscan.readthedocs.io/en/latest/

BSD 3-Clause "New" or "Revised" License

2.81k stars 506 forks source link

allow_single_cluster=True, running time is doubled when exactly one cluster is segmented #596

Open TrolletTrygve opened 1 year ago

TrolletTrygve commented 1 year ago

When using allow_single_cluster=True, running time is doubled on samples where only one cluster is segmented. This is compared to the running time on samples of similar size or if allow_single_cluster=False is used.

Running on 2D points.

hdbscan.HDBSCAN(
        algorithm='best', 
        metric='euclidean', 
        min_cluster_size=min_cluster_size_, 
        allow_single_cluster=True, 
        p=None,
        cluster_selection_method='eom',
        min_samples = 1,
        cluster_selection_epsilon=0,
        core_dist_n_jobs=1,
        approx_min_span_tree=True
        )

chart(5)