Outlier Detection - Help to understand the right parameter configuration.

dlop3469 commented 7 years ago

Hi,

First of all, thanks for the library I am really enjoying it. Sorry for the long thread, but I think this could be also informative for others on a similar situation, hence all the plots attached at the end.

I am currently working on a problem to understand purchasing patterns of a given user, so that I can spot anomalies on it, and I am struggling to find out which is the correct min_cluster_size which will give me the best outcome.

After following the tutorial regarding outlier detection, I am getting different results for different configurations, but I don’t know which would be the right process when defining those parameters.

On this example, I am clustering ~800 records of 5 variables each, and considering that I am plotting the 95th percentile of the outliers result, and that I am doing PCA_2D when plotting those results, here are the questions that I hope you can help me with:

Which one of the following shapes per each min cluster size seems more healthy when doing the condensed tree?
After printing the 95th percentile of the outliers, I can see that in all those cases the number returned is very similar (~ 40 points). Should I consider all those values from the 95th percentile as equally important when listing them as anomalous or could I just focus on the first 10, for example?
Is there an easy way to figure it out if I am introducing variables that are causing noise and potentially affecting the end result?

Here is the outcome that I can see for various cluster sizes:

Scatterplot on PCA_2D

captura de pantalla 2017-06-17 a les 21 28 34

HDBSCAN, min_cluster_size = 2

captura de pantalla 2017-06-17 a les 21 49 29

captura de pantalla 2017-06-17 a les 21 49 42

captura de pantalla 2017-06-17 a les 21 49 56

captura de pantalla 2017-06-17 a les 21 50 09

captura de pantalla 2017-06-17 a les 21 50 17

HDBSCAN, min_cluster_size = 5

captura de pantalla 2017-06-17 a les 21 28 53

captura de pantalla 2017-06-17 a les 21 29 11

captura de pantalla 2017-06-17 a les 21 29 25

captura de pantalla 2017-06-17 a les 21 39 35

captura de pantalla 2017-06-17 a les 21 39 46

HDBSCAN, min_cluster_size = 10

captura de pantalla 2017-06-17 a les 21 41 56

captura de pantalla 2017-06-17 a les 21 42 10

captura de pantalla 2017-06-17 a les 21 42 21

captura de pantalla 2017-06-17 a les 21 42 39

captura de pantalla 2017-06-17 a les 21 42 46

HDBSCAN, min_cluster_size = 15

captura de pantalla 2017-06-17 a les 21 44 52

captura de pantalla 2017-06-17 a les 21 45 04

captura de pantalla 2017-06-17 a les 21 45 13

captura de pantalla 2017-06-17 a les 21 45 26

captura de pantalla 2017-06-17 a les 21 45 34

HDBSCAN, min_cluster_size = 30

captura de pantalla 2017-06-17 a les 21 47 05

captura de pantalla 2017-06-17 a les 21 47 17

captura de pantalla 2017-06-17 a les 21 47 25

captura de pantalla 2017-06-17 a les 21 47 38

captura de pantalla 2017-06-17 a les 21 47 45

Thanks for your help.

lmcinnes commented 7 years ago

Hi, thanks for all the detail! I have to admit that I know less (and have less confidence in) the GLOSH outlier detection in hdbscan -- I implemented this according to the paper by Campello et al. and haven't studied things in the same depth that I have for clustering (for which I have some improvements coming - -stay tuned). What you have here looks like an excellent empirical study of what the outlier detection is doing. If you would be interested in wiriting up some of this when you get done I would be more than happy to add it as documentation -- let me know if you're willing.

Now, as for some of what may be going, and some intuitions ...

You can view the clustering as providing an empirical "noise tolerant" smoothed non-parametric estimate of the probability density from which the data samples were drawn. In other words we are trying to create an estimate that can assign points a probability of having been drawn given the data we have. Now, obviously doing that accurately with limited data is very hard (particularly as we are not assuming a known parameterised probability distribution, or mixture thereof). That means what we get out isn't really a probability density, just a ballpark estimate. Now, the outlier detection part is seeking to find points that are locally unlikely -- that is, given the probability density near them, they have a "surprisingly" low probability.

How do parameters effect this? Well first we should be careful to note that there are ultimately two parameters that will matter, min_cluster_size and min_samples. Now, when min_samples isn't specified it just gets set to whatever min_cluster_size was set to. In terms of intuition you can think of min_cluster_size as controlling the "smoothing" of the probability density estimate, and min_samples as controlling the "noise tolerance". The higher you set min_cluster_size the more we "smooth" the density among nearby points. The higher you set min_samples the more the algorithm will disregard noisy points when building its density estimate.

If you just set min_cluster_size you'll be changing both parameters at the same time. While this is useful as a default, if you are looking at what happens under varying parameters, as you are here, it confuses things a little. I would suggest that you specify both parameters, and fix one while changing the other (and vice versa). Hopefully that may give slightly more interpretable results (although, as I said, I have not worked with the GLOSH outlier detection anywhere near as much, so I honestly can't speak to it that well).

Hopefully this is a little help, but please feel free to ask more questions (and post more pictures) and hopefully we can work this out together.

dlop3469 commented 7 years ago

Hi, thanks for your answer!

Sure, once I finish with the project (and make some sense out of it) will come back to you regarding the documentation of it.

Regarding the problem, after reading the difference between those two parameters it makes more sense following the example on the link you mentioned.

So, I have re-run all the permutations with the values I posted before and combine the plots so that is easier to visualise. I can attach any of those separately if anyone want to see more details, but if you click on each image the resolution should be good enough to be able to read through the data.

Thanks again, I hope this helps to understand better what would be the right values to use and to explain the variations on the condensed tree for each pair of values.

Here are the results:

montage_array

lmcinnes commented 7 years ago

Thanks, it will take me a while to have time to pick through all this, but I certainly welcome any analysis and commentary you have to share as you look at how this all works.

dlop3469 commented 7 years ago

Hi,

Thanks for your help. This is what I can see so far:

On the first plot, min_cluster_size=10 and min_samples=3 seems to be the one that fits better with the shape of the diagram, but as a trend the moment the min_cluster_size and min_samples is larger than 10, then the big mass on the corner starts to get divided into at least 3 clusters. In this situation, would you aim for a cluster that can consider the whole mass on the left, or would it be interesting to go with a more granular approach that will break this big cluster into smaller ones, so that you can get local differences in those cluster?
The thirds plot, regarding the different condensed trees, I can see that the larger the values on min_cluster_size and min_samples, the bigger the density of the first group. On those examples, min_cluster_size = 5 and min samples=3 has lots of small ramifications, but it ends up with just one cluster (the massive mass). On the other hand, min_cluster_size=15 and min_samples=3 has more different groups (6), as displayed on the first plot, where the big mass has been split into smaller clusters. In this context, although I understand how each configuration relates to each condensed tree graph, I do not know how to make a good judgement when looking into different condensed graphs to decide whether a configuration is better than the other. Should I aim for long lived clusters (eg lambda value = 6), for a more even clustering (like 15,3) or for a cluster where the leafs are the end clusters or vice-versa?
When plotting the outlier distribution, some graphs will go up to the lowest percentile, and then go down from there, but others like min_cluster_size=15 and min_samples=30, or min_cluster_size=10 and min_samples-30 or min+cluster_size=30 and min_samples-15 are shaped like an "M", where they go up and down until the 40th percentile. Since we are just interested normally on the 95th percentile it should be fine since in all the cases is low anyway, but I wonder why is that happening. Any thoughts?
And finally, the end plot of outliers, seems to have a common theme. The point at the top right is always marked as anomalous, but the tendency is that the top right corner of points (which are separated from the big mass) will be marked as anomalous too. In some configurations, where the big mass has split into a few clusters they will also have some anomalies detected, which I guess is due to be close to the border with the following cluster. But there are some interesting examples (like min_cluster_size=10 and min_samples=3) that even if they have cluster the green mass as a single group they are highlighting a group of points on the right within the big cluster as anomalous, which makes me think that those might be very unique.

What do you think, can you spot a configuration that will look better/worse from your experience?

Thanks for your time again, I really appreciate your help.

lmcinnes commented 7 years ago

Sorry for not replying -- I've been pretty busy with a few other things and haven't had time to read, think about, and respond. Just letting you know I appreciate the work, and will endeavour to get back to as soon as I get enough time to read an digest.

vroon5 commented 4 years ago

@dlop3469: Hi David, thank you for sharing your issues. It was really informative for me as I am starting to delve into HDBscan from DBSCAN for outlier detection.

I was wondering if your codes for creating these viz is an open repository? Any link to sources that you created them would be appreciated.

Once again, thank you for sharing your thoughts about this algo. nevertheless

scikit-learn-contrib / hdbscan

Outlier Detection - Help to understand the right parameter configuration. #116