scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 501 forks source link

Outlier detection question #89

Closed dlop3469 closed 7 years ago

dlop3469 commented 7 years ago

Hi,

First of all, thanks for the implementation, I think you are doing an amazing job on this project.

My question is regarding outlier/anomaly detection. I am currently building an application to identify anomalies from a set of records and sort them based on their "importance" (so they can be reviewed later), and the GLOSH implementation is helping a lot since it has the notion of scoring.

As documented, I can just focus on the 90th percentile, and list the results in descending order from the outlierscores values.

However, since the algorithm also labels some data as "noise" (-1), I am confused in how to sort those results, whether noise (global outlier) should take preference than a value with the highest outlier score.

Could you give me an explanation of what would be the right approach to sort those results?

Thanks

lmcinnes commented 7 years ago

GLOSH provides a local notion of outlier, so even if a point is not noise, but is sufficiently far from a dense area, it will still get a very high outlier score. If you are interested primarily in anomalies -- things that are are somewhat different than one might expect (which may mean "similar, but just different enough from something where most incidences are almost identical") then you want to focus on GLOSH scores. If you are attempting to filter out "noise" that is in the background then you should worry more about points that the clustering labels as noise (-1 labels). Depending on your use case they are both valuable. Since you sound like you are more interested in ranking anomalies than finding points that are background noise I would recommend focusing on GLOSH, but obviously you know and understand your application better than I do.

On Wed, Mar 1, 2017 at 4:09 PM, David notifications@github.com wrote:

Hi,

First of all, thanks for the implementation, I think you are doing an amazing job on this project.

My question is regarding outlier/anomaly detection. I am currently building an application to identify anomalies from a set of records and sort them based on their "importance" (so they can be reviewed later), and the GLOSH implementation is helping a lot since it has the notion of scoring.

As documented, I can just focus on the 90th percentile, and list the results in descending order from the outlierscores values.

However, since the algorithm also labels some data as "noise" (-1), I am confused in how to sort those results, whether noise (global outlier) should take preference than a value with the highest outlier score.

Could you give me an explanation of what would be the right approach to sort those results?

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/89, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBQIu2Fk4mAygwJ4WZgI3lUWtwwkGks5rhd6mgaJpZM4MQNJi .

dlop3469 commented 7 years ago

Thanks, that answered my question. From your answer it seems that I am indeed more interested in anomalies itself, since I want to spot out records that are diverging from a pattern. I think will still report background noise as a separate result, so it can be manually assessed if it's important or not.