scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.82k stars 508 forks source link

Hyperparameter tuning and intrinsic quality metric(s) #110

Open rw opened 7 years ago

rw commented 7 years ago

I'm very happy with this HDBSCAN library--it's fast, well-documented, and easy to use.

Currently, I'm trying to decide if changes in my upstream featurization pipeline are improving or harming my clustering results. As my data is unlabeled, I need to use a built-in HDBSCAN metric to evaluate quality.

Is there an intrinsic measure that I can use to compare clustering runs with different data? As far as I can tell, GLOSH could be useful but I'm not sure where to start. In general, I'm looking for something similar to perplexity as used in topic modeling with LDA.

lmcinnes commented 7 years ago

In principle there is DBCV (as per Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J., 2014. Density-Based Clustering Validation. In SDM (pp. 839-847).) which is implemented in hdbscan as the validity_index function; higher scores mean "better" clusterings. Personally I am a little hesitant about this metric; it gets the right idea but I'm not convinced it is as robust as one would like. For the purposes of tuning upstream featurization it will probably do the trick.

rw commented 7 years ago

Thank you!

rw commented 7 years ago

I noticed validity_index doesn't necessarily work out-of-the-box; the distance metric manhattan is supported in training but not in computing the validity index.

lmcinnes commented 7 years ago

That's a bug. It should support all the same metrics. Let me see if I can fix that for you.

lmcinnes commented 7 years ago

Can I ask what the error you are getting is? I'm having trouble seeing any reasons why it shouldn't work with manhattan distance.

rw commented 7 years ago

According to https://github.com/scipy/scipy/blob/master/scipy/spatial/distance.py there is now no def manhattan(...) and thus no Manhattan distance function. Maybe the HDBSCAN code can validate against the list of available distance functions?

lmcinnes commented 7 years ago

That makes a lot more sense -- thanks for finding that. Yes, I should have some validation code with sensible error messages that make this more clear.

duichwer commented 4 years ago

Well there is a function in https://github.com/scipy/scipy/blob/master/scipy/spatial/distance.py which implements the Manhatten distance. The only problem is that this function is called cityblock.

Maybe there should be some sort of mapping to the scipy distance functions