Targeted Projection Pursuit

TPP is a technique for visualising and exploring high-dimensional data for exploratory data analysis, visualisation and feature selection.

Quick Start

Check if you already have java installed with java -version. If not then install it from java.com.
Download the executable jar file https://github.com/joefaith/targeted-projection-pursuit/blob/master/dist/TPP.jar
You can also clone the repo. You will need Apache Ant to build it. VS Code has a very nice extension for Ant. https://marketplace.visualstudio.com/items?itemName=nickheap.vscode-ant
Double click on the jar to run it, or run it from the command line with java -jar TPP.jar
Load a data set from a CSV file (including a header row) with File > Load CSV File.
The data is initially shown projected onto the first two principal components (X=PC1, Y=PC2). But you can then
- Drag data points or axes to explore the data and find more useful views
- Select which attributes to color or size the points by
- See the effect of clustering or common classification algorithms
- Remove attributes to see which attributes are most significant
- Tip: use the right button to resize the data to fit your window.

Why Visualise Your Data?

Why bother visualising our data? After all, isn’t the whole point of machine learning to get machines to do the data analysis for us?

This classic example from the statistics literature, known as Anscombe's Quartet, shows why visualisation is important. It shows four data sets with the same aggregate statistics: same means and standard deviations in each dimension, same linear regression slopes and residuals. But when you graph them you can immediately see they should be treated very differently, and only two of them should even be modelled with a linear regression at all.

Similar problems can occur with machine learning problems, such as classification. Suppose we had a classification problem and our model was consistently generalising at 90% accuracy. Is this good? What should we do? It depends on the data.

Here are four binary classification problems (stars and circles), with the performance of four classifiers shown: the examples where the classifications are correctly inferred are solid, the incorrect examples are unfilled. We would do something different in each of these four cases.

Example A: The classifier should be getting these right. Looks like a bug in our algorithm.
Example B: Looks like the error cases were mis-labelled. Better check the labelling process.
Example C: These classes aren’t clearly separable. 90% generalisation is probably the best we could get. Any further training could lead to overfitting.
Example D: The error cases look like outliers. We should detect these and flag up to a human that our confidence in the classification is low in these cases.

Why Targeted Projection Pursuit?

Visualising simple 2D cases is easy enough, but most of the data we deal with is far more complex – especially when dealing with latent representations. We need ways of visualising complex data in just two screen dimensions, and there are many ways to reduce the dimensionality of complex data so it can be visualised. The simplest is just to pick two or three of the dimensions and ignore the rest, but it's very hard to see what's going on.

Other algorithms, such as projecting into principal components or t-SNE try to squeeze multiple dimensions into just two or three, but this always results in the loss of potentially important information. No single view will show us everything we need. We need ways of 'rotating' the data, so we can see what it looks like from various angles.

Visualisations that use linear projections – such as PCA – have an advantage in that the 'angle' of the projection itself can provide useful information about the data, such as which dimensions are more important in classification. Targeted Project Pursuit is the higher dimensional equivalent of rotating an object to explore it. The data is initially shown projected onto the first two principal components (X = PC1, Y = PC2), but you can then rotate the data to see it from other dimensions. You can do this by dragging and dropping specific axes, or by grabbing and dragging the data itself. The TPP algorithm will then try to find and angle of the data that matches your actions most closely. The easiest way to see this is to try it yourself!

Background

TPP was originally developed to visualise gene expression data, to help clinicians diagnose early-stage cancers. The code is mostly old (>10years) java, and built on top of the Weka machine learning package, and some features have succumbed to bit rot, but if it’s useful then I’ll progressively resurrect it. Let me know. There's more information here, including links to relevant papers: https://en.wikipedia.org/wiki/Targeted_projection_pursuit

joefaith / targeted-projection-pursuit

readme

Targeted Projection Pursuit

Quick Start

Why Visualise Your Data?

Why Targeted Projection Pursuit?

Background