Open rw opened 7 years ago
In principle there is DBCV (as per Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J., 2014. Density-Based Clustering Validation. In SDM (pp. 839-847).) which is implemented in hdbscan as the validity_index
function; higher scores mean "better" clusterings. Personally I am a little hesitant about this metric; it gets the right idea but I'm not convinced it is as robust as one would like. For the purposes of tuning upstream featurization it will probably do the trick.
Thank you!
I noticed validity_index
doesn't necessarily work out-of-the-box; the distance metric manhattan
is supported in training but not in computing the validity index.
That's a bug. It should support all the same metrics. Let me see if I can fix that for you.
Can I ask what the error you are getting is? I'm having trouble seeing any reasons why it shouldn't work with manhattan distance.
According to https://github.com/scipy/scipy/blob/master/scipy/spatial/distance.py there is now no def manhattan(...)
and thus no Manhattan distance function. Maybe the HDBSCAN code can validate against the list of available distance functions?
That makes a lot more sense -- thanks for finding that. Yes, I should have some validation code with sensible error messages that make this more clear.
Well there is a function in https://github.com/scipy/scipy/blob/master/scipy/spatial/distance.py which implements the Manhatten distance.
The only problem is that this function is called cityblock
.
Maybe there should be some sort of mapping to the scipy distance functions
I'm very happy with this HDBSCAN library--it's fast, well-documented, and easy to use.
Currently, I'm trying to decide if changes in my upstream featurization pipeline are improving or harming my clustering results. As my data is unlabeled, I need to use a built-in HDBSCAN metric to evaluate quality.
Is there an intrinsic measure that I can use to compare clustering runs with different data? As far as I can tell, GLOSH could be useful but I'm not sure where to start. In general, I'm looking for something similar to perplexity as used in topic modeling with LDA.