scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 501 forks source link

Clustering with only instances of 'one' data point #305

Open hammadmazhar1 opened 5 years ago

hammadmazhar1 commented 5 years ago

Clustering on a large amount of network data, I have hit upon a case where I essentially have the same data point, just repeated multiple times (600 points or so) as the set of data to cluster on. This should lead to a single cluster in practice due to zero distance between points (unless I am severly misunderstanding the principle HDBSCAN works on). However, fitting on this data with the allow_single_cluster=True option, returns the warning: Clusterer does not have any defined clusters, new data will be automatically predicted as noise.. I plan to use this to classify new data, this is obviously not the right outcome for me.

Any suggestions? I'm currently building the clusterer with: clusterer = hdbscan.HDBSCAN(algorithm='boruvka_balltree',memory=mem_cache,core_dist_n_jobs=5,metric='manhattan',min_cluster_size=min_clust_size,min_samples=min_samp,prediction_data=True,allow_single_cluster=True,cluster_selection_method='leaf')

hammadmazhar1 commented 5 years ago

I think this might have happened due the cluster selection method. I switched to eom which works fine (aka it gives me a cluster, "0"). But I still receive the Clusterer does not have any defined clusters, new data will automatically predicted as noise warning.

Amarnath-17 commented 3 years ago

Hi, Even I'm getting the below warning when using approximate_predict() function. "UserWarning: Clusterer does not have any defined clusters, new data will be automatically predicted as noise."

My fit data has only one cluster (cluster 0) and outliers (-1). clusterer = hdbscan.HDBSCAN(metric='euclidean',cluster_selection_method='eom',allow_single_cluster=True,prediction_data=True).fit(x) I'm trying to use Hdbscan to find outliers and save the model to predict new data points using approximate_predict().

Is there any way to do it?

Thanks.