kenuxi / EVA

IoSL SS 2020
0 stars 0 forks source link

EVA

Vision is arguably one of our greatest strengths as humans. Even the most complex ideas tend to become much easier to understand as soon as we find a way to visualize them. Therefore, when we are working with datasets that have more than 3 dimensions, it can be challenging just to understand what’s going on inside of them.

Anomaly detection is a subfield of AI that detects anomalies in datasets. A machine learning algorithm learns the "patterns" that the majority of cases adhere to and then singles out the few cases that deviate from those normal patterns. Appropriate projections of the high-dimensional datasets to two dimensions often help by isolating anomalies in the image of those projections. The anomalies can then easily be spotted by a human just by looking at the image. There are many different techniques with different strengths and weaknesses.

We have developed develop a modular, extensible anomaly visualization framework with graphical user interface that allows to evaluate different data visualizations techniques on user provided datasets.

Installing

Create new python3 virtual environment:

python3 -m venv venv

and source it:

source venv/bin/activate

Install libraries:

pip install -r requirements.txt

Run the app:

python run.py

Go to this address in your web browser:

http://127.0.0.1:5000

Docker

Create docker image

docker build -t name:tag . (example: docker build -t EVA:latest .)

Run the app

docker run -p 5000:port name:tag (example: docker run -p 5000:5000 EVA:latest)

Documentation

Usage

After you run the server on http://127.0.0.1:5000, you will be greeted with this home screen.

home

Here, you can either choose an existing dataset that are in /data or you can upload your own dataset. After, you upload your own dataset it gets listed on the dropdown where you can choose and submit it. After you decide to submit the dataset, a table will be created for you to have a peek into it. If the dataset is too large(>5000 datapoints), only dataset head is displayed.

table

However for the smaller dataset you can navigate through the displayed table.

table1

Now, you can apply filter to your dataset and choose the label column. You can also choose 'None'

filter

Now for the main part, you will have the oppertunity to choose an algorithm to apply to the dataset as well as a visualisation technique you prefer.

For algorithm, you can choose between:

As visualisation methods, you can choose between:

pick

After you choose algorithms and visualisations methods, you will be redirected to the page which shows the plots obtained from chosen algorithms

Dimensionality reduction algorithms

Dimensionality reduction algorithms help with understanding data through visualisation. The main concept of dimensionality reduction is using techniques to embed high dimensional (D > 3) points to lower, usually 2 or 3 dimensional space to plot the data.

EVA implements following dimensionality reduction algorithms:

PCA

Info

Principal component analysis (PCA) is a classic algorithm for dimensionality reduction. PCA transforms points from the original space to the space of uncorrelated features over given dataset via eigendecomposition of covariance matrix. PCA by design is a linear algorithm meaning that it's not capable of capturing non-linear correlations.

https://en.wikipedia.org/wiki/Principal_component_analysis

Adjustable parameters

https://cs.nyu.edu/~roweis/lle/papers/lleintro.pdf

Adjustable parameters

http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

Adjustable parameters

UMAP

Info

Uniform Manifold Approximation and Projection (UMAP) is a general dimensionality reduction algorithm using topological tools. UMAP assumes that the data is uniformly distributed on a Rimannian manifold which is locally connected and the Rimannian metric is locally constant or can be approximated as such. UMAPS models data manifold with fuzzy topological structure and embed the data into low dimensional space by finding closest possible equivalent topological structure.

https://arxiv.org/pdf/1802.03426.pdf

Adjustable parameters

http://web.mit.edu/cocosci/Papers/sci_reprint.pdf

Adjustable parameters

Info

Kepler-MAPPER (k-MAPPER) is python library implementing MAPPER algorithm form the topological data analysis field. k-MAPPER use embedding created by other dimensionality reduction algorithm (ex. t-SNE) and pass it to MAPPER algorithm.

https://github.com/scikit-tda/kepler-mapper

Adjustable parameters

https://en.wikipedia.org/wiki/Multidimensional_scaling

Adjustable parameters

Extending EVA to new dimensionality reduction algorithm

Structure of this project allows extending dashboard to support new dimensionality reduction algorithms with custom, controllable parameters. In order to add new algorithm, follow these steps:

  1. Extend EvaData class in application/plotlydash/Dashboard.py with apply_{name_of_your_alg} method following convention of the other apply methods.
  2. Extend _getgraph and _getdropdowns methods in DimRedDash class in application/plotldydash/dim_red_dshboards.py to support new plots and callbacks for your new algorithm.
  3. Add new form options to VisForm class in forms.py file,
  4. Edit form code in application/templates/home.html to include your new, updated form.

Results

To demonstrate the results we have selected 3 datasets, namely, FMNIST, MNIST and FishBowl. We have then plotted 2D scatter-plot for all the datasets and selected algorithms.

FMNIST

Fashion-MNIST (FMNIST) is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.

https://github.com/zalandoresearch/fashion-mnist

This is how the data looks (each class takes three-rows):

fmnist

For the visualisation purpose here, we have applied following filters:

PCA

pca

LLE

lle

TSNE

tsne

UMAP

umap

ISOMAP

isomap

From the visualisation it is clear that UMAP performs the better job of isolating the outliers.

MNIST

The Modified National Institute of Standards and Technology (MNIST) database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

http://yann.lecun.com/exdb/mnist/

Sample image for MNIST dataset:

MNIST

For the visualisation purpose here, we have applied following filters:

PCA

pca

LLE

lle

UMAP

umap

From the visualisation it is clear that UMAP performs the better job of isolating the outliers.

FishBowl

Fish Bowl dataset comprises a sphere embedded in 3D whose top cap has been removed. In other words, it is a punctured sphere, which is sparsely sampled at the bottom and densely at the top.

For the visualisation purpose here, we have applied following filters:

PCA

pca

LLE

lle

TSNE

tsne

UMAP

umap

ISOMAP

isomap

MDS

mds

From the visualisation, we can say that LLE argueably performs the better job of isolating the outliers.

Scalability

When reducing large data sets with high dimensional data to 2 or 3 dimensions where we can visualize it, the computational complexity of the algorithms play a major role in how much time it takes an algorithm to reduce all the data. We mainly worked with the Mnist and Fashion Mnist data sets. Both contain 60.000 images with each image having 28x28 pixels corresponding to 784 features per Image. The next points give an overview of what we learned from our experience using the app & different algorithms on these large datasets.

The good news is that there is one algorithm that is as fast as PCA and is able to unfold non linear data as well as TSNE. This is the UMAP algorithm, which is the most modern one of all (2018).

Last but not least, we conducted some research regardin the scalability of UMAP in comparison to other algorithms. The next both figures can be found in the original UMAP paper from 2018. scalability scalability_data