joefaith / targeted-projection-pursuit

Automatically exported from code.google.com/p/targeted-projection-pursuit
0 stars 0 forks source link

Targeted Projection Pursuit

TPP is a technique for visualising and exploring high-dimensional data for exploratory data analysis, visualisation and feature selection.

Quick Start

Why Visualise Your Data?

Why bother visualising our data? After all, isn’t the whole point of machine learning to get machines to do the data analysis for us?

This classic example from the statistics literature, known as Anscombe's Quartet, shows why visualisation is important. It shows four data sets with the same aggregate statistics: same means and standard deviations in each dimension, same linear regression slopes and residuals. But when you graph them you can immediately see they should be treated very differently, and only two of them should even be modelled with a linear regression at all.

Similar problems can occur with machine learning problems, such as classification. Suppose we had a classification problem and our model was consistently generalising at 90% accuracy. Is this good? What should we do? It depends on the data.

Here are four binary classification problems (stars and circles), with the performance of four classifiers shown: the examples where the classifications are correctly inferred are solid, the incorrect examples are unfilled. We would do something different in each of these four cases.

Why Targeted Projection Pursuit?

Visualising simple 2D cases is easy enough, but most of the data we deal with is far more complex – especially when dealing with latent representations. We need ways of visualising complex data in just two screen dimensions, and there are many ways to reduce the dimensionality of complex data so it can be visualised. The simplest is just to pick two or three of the dimensions and ignore the rest, but it's very hard to see what's going on.

Other algorithms, such as projecting into principal components or t-SNE try to squeeze multiple dimensions into just two or three, but this always results in the loss of potentially important information. No single view will show us everything we need. We need ways of 'rotating' the data, so we can see what it looks like from various angles.

Visualisations that use linear projections – such as PCA – have an advantage in that the 'angle' of the projection itself can provide useful information about the data, such as which dimensions are more important in classification. Targeted Project Pursuit is the higher dimensional equivalent of rotating an object to explore it. The data is initially shown projected onto the first two principal components (X = PC1, Y = PC2), but you can then rotate the data to see it from other dimensions. You can do this by dragging and dropping specific axes, or by grabbing and dragging the data itself. The TPP algorithm will then try to find and angle of the data that matches your actions most closely. The easiest way to see this is to try it yourself!

Background

TPP was originally developed to visualise gene expression data, to help clinicians diagnose early-stage cancers. The code is mostly old (>10years) java, and built on top of the Weka machine learning package, and some features have succumbed to bit rot, but if it’s useful then I’ll progressively resurrect it. Let me know. There's more information here, including links to relevant papers: https://en.wikipedia.org/wiki/Targeted_projection_pursuit