scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 501 forks source link

HDBSCAN AttributeError #287

Closed rick77777 closed 5 years ago

rick77777 commented 5 years ago

I am trying to use HDBSCAN for a project. After understanding it I was trying to implement the most basic system generated clustering. But I got an error - AttributeError: 'HDBSCAN' object has no attribute 'labels_' Here is the code I wrote - blobs,labels= make_blobs(n_samples=2000, n_features=17) clusterer = hd.HDBSCAN() clusterer.fit(blobs) clusterer = hd.HDBSCAN(algorithm='best', alpha=1.0, approx_min_span_tree=True, gen_min_span_tree=False, leaf_size=40, metric='euclidean', min_cluster_size=5, min_samples=None, p=None)

clusterer.labels_

So, I can't get the cluster labels. Please suggest where I went wrong. Thanks in advance!!

lmcinnes commented 5 years ago

I think you want something more like

blobs,labels= make_blobs(n_samples=2000, n_features=17)
clusterer = hd.HDBSCAN(algorithm='best', alpha=1.0, approx_min_span_tree=True,
gen_min_span_tree=False, leaf_size=40, metric='euclidean', min_cluster_size=5,
min_samples=None, p=None)
clusterer.fit(blobs)
clusterer.labels_

You have to have all the parameters set before calling fit, and then the labels will be available after fit has been called. In your example code you re-assigned clusterer to a new object with various parameters set before trying to access the labels. The new object had not been fit to any data yet.

rick77777 commented 5 years ago

Thanks a lot,it worked for me. Now I understand what went wrong. Another question, while giving the Memory parameter it gave me a name error, so I just removed it from the parameter list as I don't know what it is used for. Can you please tell me what was the error there? And thanks a lot for helping out.

lmcinnes commented 5 years ago

I'm not sure what the error was, but that parameter is only really useful if you are planning to cache intermediate results to a temporary directory (which you would provide a path to as the value of the param) so you can re-run with many different min_cluster_size (and a single fixed min_samples) values quickly.

rick77777 commented 5 years ago

Thanks for the explanation Sir,really helped a lot.