Closed alrod97 closed 4 years ago
@ first point, I don't think it's necessary to do so. LLE and TSNE are generally speaking better algorithms for dim reduction since they can capture non linear dependancies and are more "sophisticated". @ second point, True. I don't see easy way to create anomalies there @ third point, I agree. There's a script for that already.
I realized, that TSNE is best algorithm to visualize the outliers in the mnist dataset. The 'mnist_zero_one' dataset consists on approx. 7000 '1' ones and approx 70 '0' zeros. The TNSE Alg. produces the next plots when reducing the data to 2 or 3 dimensions.
It seems that the mnist dataset is a common example people use to demonstrate the "power" of TSNE https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b
It looks like UMAP works better than TSNE and preserves the global structure, in most of the cases:
https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668
https://towardsdatascience.com/tsne-vs-umap-global-structure-4d8045acba17
Right now, we can already use PCA, TSNE, LLE, and Kernel PCA to reduce the dimensionality of high dimensional data to visualize the outliers. Nevertheless, we still need some "good" labeled anomalies CSV data. The idea of the project is to investigate which of these algorithms work better for anomalies visualization.
Someone should try to find e.g labeled ('outliers' and 'inliers') data frame where PCA clearly works much better than LLE or TSNE and we can clearly see that the outliers are separated from the inliers in the 2/3 dim space. The same for a dataset where LLE clearly works better than the other ones.
My guess is that PCA will fail in e.g nonlinear datasets as the swiss roll. Maybe one can try to add some outliers to this dataset and see how LLE performs vs PCA (LLE should perform better).
We should also try high dim data as e.g images in the mnsit dataset. "Real data" as images or sth else would be also nice