Open den-run-ai opened 6 years ago
UMAP will tend pull outliers in. It will find extreme outliers, but this is not the approach you probably want. I think the 'outlier' notions in this gist are more what you are after. Ultimately this is a sort of co-UMAP (reverse the arrows) for clustering, and dual co-UMAP for outlier detection. I haven't written code to do all of this efficiently yet, but it is on my todo list.
@lmcinnes so essentially first pre-filter with hdbscan and then apply umap?
I think it really depends on what you are trying to do, but yes, something like that would represent something that bears similarities to Robust PCA. The again I think you really want some sort of regularized UMAP to do that properly. I would have to think about what that would mean/look like -- certainly an intriguing problem. Thanks for the ideas!
Sometimes the outliers are so bad that it is hard to regularize them, just excluding is easier. For example that's why I like RANSAC regression more than regularizers for linear regression.
That makes a lot of sense -- it does certainly depend on the data and your use case. At that rate filtering things out with hdbscan would probably work well.
Can I use umap for anomaly detection? Is the dimensionality reduction tolerant towards the outliers in the dataset or this totally screws up the results?
More generally I'm looking for generalization of robust PCA, but for nonlinear cases:
https://en.wikipedia.org/wiki/Robust_principal_component_analysis