scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.74k stars 492 forks source link

Outlier Detection - Help to understand the right parameter configuration. #116

Open dlop3469 opened 7 years ago

dlop3469 commented 7 years ago

Hi,

First of all, thanks for the library I am really enjoying it. Sorry for the long thread, but I think this could be also informative for others on a similar situation, hence all the plots attached at the end.

I am currently working on a problem to understand purchasing patterns of a given user, so that I can spot anomalies on it, and I am struggling to find out which is the correct min_cluster_size which will give me the best outcome.

After following the tutorial regarding outlier detection, I am getting different results for different configurations, but I don’t know which would be the right process when defining those parameters.

On this example, I am clustering ~800 records of 5 variables each, and considering that I am plotting the 95th percentile of the outliers result, and that I am doing PCA_2D when plotting those results, here are the questions that I hope you can help me with:

  1. Which one of the following shapes per each min cluster size seems more healthy when doing the condensed tree?
  2. After printing the 95th percentile of the outliers, I can see that in all those cases the number returned is very similar (~ 40 points). Should I consider all those values from the 95th percentile as equally important when listing them as anomalous or could I just focus on the first 10, for example?
  3. Is there an easy way to figure it out if I am introducing variables that are causing noise and potentially affecting the end result?

Here is the outcome that I can see for various cluster sizes:

Scatterplot on PCA_2D

captura de pantalla 2017-06-17 a les 21 28 34

HDBSCAN, min_cluster_size = 2

captura de pantalla 2017-06-17 a les 21 49 29 captura de pantalla 2017-06-17 a les 21 49 42 captura de pantalla 2017-06-17 a les 21 49 56 captura de pantalla 2017-06-17 a les 21 50 09 captura de pantalla 2017-06-17 a les 21 50 17

HDBSCAN, min_cluster_size = 5

captura de pantalla 2017-06-17 a les 21 28 53 captura de pantalla 2017-06-17 a les 21 29 11 captura de pantalla 2017-06-17 a les 21 29 25 captura de pantalla 2017-06-17 a les 21 39 35 captura de pantalla 2017-06-17 a les 21 39 46

HDBSCAN, min_cluster_size = 10

captura de pantalla 2017-06-17 a les 21 41 56 captura de pantalla 2017-06-17 a les 21 42 10 captura de pantalla 2017-06-17 a les 21 42 21 captura de pantalla 2017-06-17 a les 21 42 39 captura de pantalla 2017-06-17 a les 21 42 46

HDBSCAN, min_cluster_size = 15

captura de pantalla 2017-06-17 a les 21 44 52 captura de pantalla 2017-06-17 a les 21 45 04 captura de pantalla 2017-06-17 a les 21 45 13 captura de pantalla 2017-06-17 a les 21 45 26 captura de pantalla 2017-06-17 a les 21 45 34

HDBSCAN, min_cluster_size = 30

captura de pantalla 2017-06-17 a les 21 47 05 captura de pantalla 2017-06-17 a les 21 47 17 captura de pantalla 2017-06-17 a les 21 47 25 captura de pantalla 2017-06-17 a les 21 47 38 captura de pantalla 2017-06-17 a les 21 47 45

Thanks for your help.

lmcinnes commented 7 years ago

Hi, thanks for all the detail! I have to admit that I know less (and have less confidence in) the GLOSH outlier detection in hdbscan -- I implemented this according to the paper by Campello et al. and haven't studied things in the same depth that I have for clustering (for which I have some improvements coming - -stay tuned). What you have here looks like an excellent empirical study of what the outlier detection is doing. If you would be interested in wiriting up some of this when you get done I would be more than happy to add it as documentation -- let me know if you're willing.

Now, as for some of what may be going, and some intuitions ...

You can view the clustering as providing an empirical "noise tolerant" smoothed non-parametric estimate of the probability density from which the data samples were drawn. In other words we are trying to create an estimate that can assign points a probability of having been drawn given the data we have. Now, obviously doing that accurately with limited data is very hard (particularly as we are not assuming a known parameterised probability distribution, or mixture thereof). That means what we get out isn't really a probability density, just a ballpark estimate. Now, the outlier detection part is seeking to find points that are locally unlikely -- that is, given the probability density near them, they have a "surprisingly" low probability.

How do parameters effect this? Well first we should be careful to note that there are ultimately two parameters that will matter, min_cluster_size and min_samples. Now, when min_samples isn't specified it just gets set to whatever min_cluster_size was set to. In terms of intuition you can think of min_cluster_size as controlling the "smoothing" of the probability density estimate, and min_samples as controlling the "noise tolerance". The higher you set min_cluster_size the more we "smooth" the density among nearby points. The higher you set min_samples the more the algorithm will disregard noisy points when building its density estimate.

If you just set min_cluster_size you'll be changing both parameters at the same time. While this is useful as a default, if you are looking at what happens under varying parameters, as you are here, it confuses things a little. I would suggest that you specify both parameters, and fix one while changing the other (and vice versa). Hopefully that may give slightly more interpretable results (although, as I said, I have not worked with the GLOSH outlier detection anywhere near as much, so I honestly can't speak to it that well).

Hopefully this is a little help, but please feel free to ask more questions (and post more pictures) and hopefully we can work this out together.

dlop3469 commented 7 years ago

Hi, thanks for your answer!

Sure, once I finish with the project (and make some sense out of it) will come back to you regarding the documentation of it.

Regarding the problem, after reading the difference between those two parameters it makes more sense following the example on the link you mentioned.

So, I have re-run all the permutations with the values I posted before and combine the plots so that is easier to visualise. I can attach any of those separately if anyone want to see more details, but if you click on each image the resolution should be good enough to be able to read through the data.

Thanks again, I hope this helps to understand better what would be the right values to use and to explain the variations on the condensed tree for each pair of values.

Here are the results:

montage_array montage_array montage_array montage_array montage_array montage_array

lmcinnes commented 7 years ago

Thanks, it will take me a while to have time to pick through all this, but I certainly welcome any analysis and commentary you have to share as you look at how this all works.

dlop3469 commented 7 years ago

Hi,

Thanks for your help. This is what I can see so far:

What do you think, can you spot a configuration that will look better/worse from your experience?

Thanks for your time again, I really appreciate your help.

lmcinnes commented 7 years ago

Sorry for not replying -- I've been pretty busy with a few other things and haven't had time to read, think about, and respond. Just letting you know I appreciate the work, and will endeavour to get back to as soon as I get enough time to read an digest.

vroon5 commented 4 years ago

@dlop3469: Hi David, thank you for sharing your issues. It was really informative for me as I am starting to delve into HDBscan from DBSCAN for outlier detection.

I was wondering if your codes for creating these viz is an open repository? Any link to sources that you created them would be appreciated.

Once again, thank you for sharing your thoughts about this algo. nevertheless