lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.48k stars 808 forks source link

detection of outliers (anomaly detection) using umap - robust dimension reduction #42

Open den-run-ai opened 6 years ago

den-run-ai commented 6 years ago

Can I use umap for anomaly detection? Is the dimensionality reduction tolerant towards the outliers in the dataset or this totally screws up the results?

More generally I'm looking for generalization of robust PCA, but for nonlinear cases:

https://en.wikipedia.org/wiki/Robust_principal_component_analysis

lmcinnes commented 6 years ago

UMAP will tend pull outliers in. It will find extreme outliers, but this is not the approach you probably want. I think the 'outlier' notions in this gist are more what you are after. Ultimately this is a sort of co-UMAP (reverse the arrows) for clustering, and dual co-UMAP for outlier detection. I haven't written code to do all of this efficiently yet, but it is on my todo list.

den-run-ai commented 6 years ago

@lmcinnes so essentially first pre-filter with hdbscan and then apply umap?

lmcinnes commented 6 years ago

I think it really depends on what you are trying to do, but yes, something like that would represent something that bears similarities to Robust PCA. The again I think you really want some sort of regularized UMAP to do that properly. I would have to think about what that would mean/look like -- certainly an intriguing problem. Thanks for the ideas!

den-run-ai commented 6 years ago

Sometimes the outliers are so bad that it is hard to regularize them, just excluding is easier. For example that's why I like RANSAC regression more than regularizers for linear regression.

lmcinnes commented 6 years ago

That makes a lot of sense -- it does certainly depend on the data and your use case. At that rate filtering things out with hdbscan would probably work well.