smarsland / pots

1 stars 1 forks source link

Meeting Notes and ToDos 12th May #23

Open smarsland opened 4 years ago

smarsland commented 4 years ago

\documentclass{article} \usepackage{SRM2}

\title{Morphometrics of Greek Vases}

\begin{document} \maketitle

Notes from a Skype meeting between Armand Leroi, Norman MacLeod, Arianna Salili, Benjamin Tenmann, and Stephen Marsland on Tuesday 12th May (NZ time). The purpose of these notes is to summarise what we are up to, and what is meant to happen next.

\section{Current Situation}

We have images of 325 pots, which have been put into 34 classes by Armand. One side of each pot image was used to extract a contour, ignoring any handles, by a combination of Arianna's code and Armand's corrections.

The 34 classes are Armand's classification. I'm not clear if he used shape only, or the image. Some classes might not be distinguishable by shape alone. 25 and 26, 14 and 15, 29 and 30 seem the worst pairings.

Based on the one-sided curves, we have resampled them to be 70 evenly arc-length spaced points by linear interpolation. Linear interpolation did as well as a B-spline, and 70 points meant that there was virtually no visual difference between the linearly interpolated curve and the original. The contours were closed by reflecting them and removing the 2 repeated points, so there are 138 arc-length regular points on the closed curves. They were then Procrustes aligned and scaled to lie in $[-1,1] \times [-1,1]$. Not all of the methods we consider below need Procrustes alignment, but it's better to be consistent. Hence for the original set of pots to use, we have:

\texttt{GoodPots_closed_procrustes.csv}, which has 325 lines, one for each pot, laid out as $x_1, y_1, x_2, y2, \ldots, x{138}, y_{138}$. That's the only dataset in town. There is an index to go with this file, which is \texttt{PotIndex.txt}, which contains the metadata (unique ID, genus as text, species as number, original csv file).

We want to select a set of methods that will eventually enable us to classify thousands of pots, and also to consider a phylogenetic tree of them. There was some discussion about whether supervised or unsupervised methods were better. We resolved on trying both, in the sense that something like $k$-means isn't really supervised, but a neural network definitely is.

\section{Morphometric Methods}

We have an embarrassment of methods, in a 4 programming languages. I'm going to try and summarise each here. All of these methods are unsupervised in the sense that they know nothing about the classes.

\begin{description} \item[Eigenshapes] Also known by a myriad of other names. It is the SVD of the covariance of the landmark coordinates. Gives you back a transformation of the space with basis vectors being ordered by variance, so you can do principled dimensionality reduction. Assumes the landmarks are in correspondence, although it seems to matter less than I expected. Cheap to compute. \item[SRVF] The square-root velocity function transforms curves to lie in a nicer space. Diffeomorphic registration happens there, so we get pairwise distances between shapes. The registration is computationally expensive. \item[Diffeomorphic Curve Matching (current metric)] This corresponds to a $H^1$ metric on the curve shape, but gives an actual embedding using PCA. Reasonably cheap to compute. \item[Diffeomorphi Curve Matching ($H^2$ curve metric)] Some old code I had in Matlab. Gives pairwise distances. Computationally expensive. \item[LDDMM registration] The standard workhorse for these things. I'm using the implementation in deformetrica, although I don't like it so much. But it's fairly fast. Produces distance to a template. But not all pairwise distances without some annoying scripting. \end{description}

The diffeomorphic methods need the Karcher mean and then use Principal Geodesic Analysis (or tangent PCA) to consruct an equivalent of an eigenshape model.

\section{Embeddings}

For the methods that give us distances, we need to embed them into Euclidean space. There are three methods of doing this that can be easily played with. They all have the same issue, which is that you need to choose the number of dimensions, as you do with PCA. \begin{itemize} \item[Multi-dimensional scaling (MDS)] Standard linear approach. The stress is effectively the mismatch that remains, which is partly a function of the lack of sufficient dimensions. \item[Isomap] A nonlinear form of MDS. The distance matrix is reduced to nearest neighbours, and then a graph search algorithm used to construct the whole thing, and then that is embedded using MDS. So it effectively fits local patches to the data and strings them together. \item[Spectral embedding] Uses Laplacian Eigenmaps to perform the embedding. In other words, finds similar points, and clusters them on a graph, and then embeds the graph using the heat kernel. \end{itemize}

All of these methods are unsupervised: we are just finding an embedding that preserves the distance matrix as far as possible.

There is the option to combine the classification and embedding, which might be worth thinking about.

We should think about whether BIC would be an appropriate measure to compare things at this point.

\section{Classification}

There are two real points to distinguish things here. First, supervised or unsupervised methods. And second, what they take as input.

For the supervised methods, we are looking for some form of multi-class classification. Norman has used CVA, which can be seen as combining eigenshapes with the class labels to push classes apart. An alternative would be to use LDA (or QDA) on the embedding. Other options are very standard: use the embedding data points as they are as input to a (small) neural network, or to a support vector machine.

For unsupervised learning, i.e., clustering, either $k$-means or a hierarchical clustering would be as good as anything else I presume.

\section{Random Walks}

There is also random walk data. This was made by Armand as a random walk in linear PCA space as I understand it. The idea is that the methods should reflect the distance along the random walk. Running this at every point along the random walks is taking to take a lot of computer time.

\section{Phylogenetics}

Ideally the embedded data can be used as input to a Bayesian phylogenetic package (revBayes seems to be the one of choice) to see if we can identify a phylogeny of pots.

\section{Future Data}

We can view the data that we have now (the 325 pots) as training data. Armand is going to collect another set, ideally of about 500 pots, from the same set of classe, which we will resample and Procrustes align. Then he will send them to some pot experts and ask them to classify them. This will give target labels, which will (a) be fed back to the training data so that we have consistent classes there, and (b) then be used to test the methods we decide to use. There are a few issues to sort out here, primarily how to deal with lack of concensus in the experts.

\section{Tasks}

\begin{enumerate} \item Get all code in a state that it is debugged, and relatively easy to run. (SM, NM, AS) \item Choose methods that are successful enough for futher use. (SM, NM) \item Prepare a new dataset of curves read for expert classification. (AL, AS) \item Work with RevBayes, etc. so that phyogenetic trees can be constructed. (BT) \end{enumerate}

\end{document}

Armand1 commented 4 years ago

Thanks stephen --- for the RW vases, why don't your sample every 25 time steps/1000? You just need enough to get the shape of the curve.