kenuxi / EVA

IoSL SS 2020
0 stars 0 forks source link

Anomalies labeled datasets #2

Closed alrod97 closed 4 years ago

alrod97 commented 4 years ago

Right now, we can already use PCA, TSNE, LLE, and Kernel PCA to reduce the dimensionality of high dimensional data to visualize the outliers. Nevertheless, we still need some "good" labeled anomalies CSV data. The idea of the project is to investigate which of these algorithms work better for anomalies visualization.

romaresccoa commented 4 years ago

@ first point, I don't think it's necessary to do so. LLE and TSNE are generally speaking better algorithms for dim reduction since they can capture non linear dependancies and are more "sophisticated". @ second point, True. I don't see easy way to create anomalies there @ third point, I agree. There's a script for that already.

alrod97 commented 4 years ago

I realized, that TSNE is best algorithm to visualize the outliers in the mnist dataset. The 'mnist_zero_one' dataset consists on approx. 7000 '1' ones and approx 70 '0' zeros. The TNSE Alg. produces the next plots when reducing the data to 2 or 3 dimensions.

Screenshot 2020-06-26 at 21 34 33 Screenshot 2020-06-26 at 21 41 58

It seems that the mnist dataset is a common example people use to demonstrate the "power" of TSNE https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b

uttamdhakal commented 4 years ago

It looks like UMAP works better than TSNE and preserves the global structure, in most of the cases:

https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668

https://towardsdatascience.com/tsne-vs-umap-global-structure-4d8045acba17