scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 501 forks source link

The Proper Way of Calculating Performance Metrics #283

Closed psygo closed 5 years ago

psygo commented 5 years ago

I didn't find this in the documentation (which is awesome in my opinion) so I figured I could maybe ask it here.

Can you just simply use common performance metrics for HDBSCAN? For example, one thing I tried with the dataset I'm working on is to take out the data that's being considered as noise (the results with the noise weren't that much different) and apply the Silhouette Index, the Calinsky-Harabasz Index and the Davies-Bouldin Index to it. However, I have no way of knowing this is anything correct. CHI and DBI might be ok, but the SIL is really weird, because when I calculate it for K-Means, I get close to 0.8 and, for HDBSCAN, a horrible 0 (actually, -0.03). So what is the proper way of using these performance metrics (or any other) on HDBSCAN?

lmcinnes commented 5 years ago

Performance metric often match up with clustering algorithms -- in that there is almost always an algorithm that optimizes the same objective that the metric measures (or, less charitably, there is almost always a clustering metric that makes a given clustering algorithm the best). A lot of metrics are not geared toward density based clustering and tend to assume spherical clusters. You might consider DBCV which is a metric designed for density based clustering. There is an implementation in validation.py in the hdbscan package.

psygo commented 5 years ago

Thanks for the reply. (I think the file is called validity.py?)

Well, validating this yielded worse results than I expected T.T (-0.7).

On a sidenote -- it seems you're way way beyond me when it comes to clustering --, would you recommend scaling the data (which I did) before applying a density-based clustering algorithm?

lmcinnes commented 5 years ago

I think ultimately it is the comparison among hyperparameters and with other algorithms that will be most telling.

As to scaling data -- are you scaling features independently? If so then that can either be sensible, or a bad idea very much depending on context. Ultimately what matters for most clustering applications is the distance between points. Rescaling features (or the data in general) is an attempt to make easy distance measures, like euclidean distance, meaningful for your data. If you have features on different scales that are of roughly equal importance then you'll want to rescale them to all have similar ranges. On the other hand if you have, say, pixel values in an image then there is an inherent scale, and rescaling based on the range of values in the data would effectively over-weight the importance of pixels that are more likely to take on low values. The question is really about what distance should mean for your data, and what would make the most sense for your specific domain. That I can't answer for you unfortunately.

psygo commented 5 years ago

After thinking a little bit more about the problem of having multiple indexes that could be offering mixed opinions on which algorithm is best, I'm wondering about simply approximating the pdfs of the found cluster distributions and calculating the area under the curve of these functions' product (or maybe their subtraction?). The closest the area gets to zero, the better the clustering.

However, it would be difficult to analyze (graphically) high N-dimensional pdfs in those terms, visualizations would yield partially incorrect views of the data, since there might be interactions between the variables. Thus, I would suggest using PCA (or maybe kernel-PCA if the data is highly non-linear) to orthogonalize the data before doing the clustering algorithm, which would "guarantee" that there are no feature interactions.

What do you think about my idea? Do you see any mistakes (or, hopefully, improvements?)?

lmcinnes commented 5 years ago

What you describe is essentially what hdbscan does. First it attempts to approximate the pdf; then, given an approximate pdf, it attempts to essentially optimize the area of pdf given by a clustering (where the clustering is derived from the tree representation of the pdf). Of course all of this is largely computationally intractable, so it all comes down to how one performs these approximations. You can try reading section 2.1: "Statistically Motivated HDBSCAN*" of our paper on the topic and see if it aligns with what you have in mind.

psygo commented 5 years ago

It's clear that I need to study more. In my defense, I wouldn't say it's laziness, but rather the feeling that I didn't know what would be enough to solve my problems. Anyway, before I close this issue, do you mind answering some final thougths? :[]

So, HDBSCAN optimizes directly for the index I had mentioned... is there a way of extracting that index from HDBSCAN's fit? And wouldn't this index be useful for evaluating any clustering algorithms?

Which books do you recommend one to read when starting out in this area? Since the documentation of HDBSCAN has an educational feel to it, it would be nice to have a section on what (and in what order) to read to fully understand what you're doing.

lmcinnes commented 5 years ago

HDBSCAN optimizes for a computable approximation of the metric you are suggesting, where the approximation involves cutting significant corners. Which corners to cut, where, and how, make up most of the difference between various density based clustering algorithms. There isn't necessarily and obvious right answer.

On the other side of things, there isn't a correct metric for clustering either. Ultimately there is no ground truth, no single true objective measure, for unsupervised learning tasks. That is kind of the point -- for whatever measure you divine there is ostensibly an algorithm to optimize against that measure, or some acceptable approximation thereof. This is more os a philosophical argument, but I would recommend Christian Hennig's "What are the True Clusters" as a good resource on this.