smarsland / pots

1 stars 1 forks source link

Athenian Black-Figure test dataset #22

Open Armand1 opened 4 years ago

Armand1 commented 4 years ago

We are proliferating results on this dataset. Please report them here so that we can keep track of all this work.

Armand1 commented 4 years ago

Description of the dataset.

325 Athenian Black Figure vases divided into 34 "species" designated by numbers and 17 "genera" designated by a traditional shape, for example, amphora_1. 3-20 individual vases per species.

In principle, the vases of each species are more similar in shape to each other than the vases among species. In practice, this is not true.

cup_14 and cup_15 are little master band and little master lip cups respectively. It's unclear whether, in fact, they have different shapes.

lekythos_24 and lekythos-25 are a mixture of slender and dumpy lethykoi due to a mistake on my part.

This is what they look like

Athens_BF_vases

Armand1 commented 4 years ago

Data pre-processing

Half-vase (open) contours were derived automatically, mostly using handle-chopping algorithms. For some vases, particularly pelikes and loutrophorai, handles were chopped manually since handle-chopping algorithms failed to chop nicely.

The initial contours were (smoothed?); fitted with a "b-spline'' and 70 points were obtained. These processed open contours were reflected to give full or closed contours. These 70x2 point closed contours were used in most subsequent analyses.

This is what they look like: 70pointdataset

Armand1 commented 4 years ago

SRVFs: elastic square-root velocity curves

Contours are converted to SRVFs which removes translation, rotation and size differences. [stuff is done? global registration?] A distance matrix is obtained among the SRVFs. HCA (WardD2) on this dm produces the following. There is a clear distinction between an "amphora" family and a "cup" family, with alabastrons and pyxis being off by themselves. But within these species the grouping is poor.

clustering

Another, less sensitive way, of assaying grouping integrity is by asking whether the closest match of each species is a congeneric. For the 70x2 point closed dataset these range between 60% and 100%. (Pelikes and Kalathos are particularly poorly classified).

Screen Shot 2020-05-08 at 16 32 49

For genera with multiple species we can do the same but for conspecifics. Here the closest match is a conspecific between 25% and 100% . The greatest confusion is between cup_11 and cup_12 (cup a and droop cup); oddly cup_14 and cup_15 (little master lip and band) are well differentiated. lekythos_25 and lekythos_26 also seem reasonably well differentiated (oddly)

Screen Shot 2020-05-08 at 16 24 42

Arianna did the same analysis for "original open curves", "70 point open curves", "original closed curves" and "70x2 point closed curves". Of these the "70x2 point closed curves" dataset worked best.

Armand1 commented 4 years ago

Norman did an eigenshape analysis on the 70x2 point closed curves dataset.

PC-1 vs PC-2

Stephen did the same: for comparison:

Image-H8CEK0

They look pretty similar.

Armand1 commented 4 years ago

Norman did a Canonical Variates Analysis on the 70x2 point closed curves dataset. This is an old kind of supervised ML that attempts to find derived variables that maximize the difference among a priori defined groups.

CV-1 vs CV-2 vs CV-3

He then jackknifed in order to test the robustness of species assignments. This is a plot of his confusion matrix. It's clear that he can assign nearly all species with a high accuracy (most are 100%). Some are confused: cup_11, cup_12, cup_13, cup_14 are all more or less confused with each other; cup_17 and cup_18 are too. Pelikes are confused, to some degree, with amphora_9. There is also some confusion between some amphora classes.

confustionmatrix

Armand1 commented 4 years ago

Embedding SRVF distances into linear space

Stephen "embedded" Arianna's SRVFs distances into linear dimensions in several ways. This was to test their use in phylogeny. The methods are: 2d embedding; 3d embedding (I think these are Multidimensional Scaling); "currents" and "monomial".

Here are some plots of those

2d embedding MDS Closed_embedding_2D

3d embedding MDS

Closed_embedding_3dxy Closed_embedding_3dyz

currents currents

monomial monomial

I do not like currents and monomial since they place very different shapes close to each other. But we will work with 2d and 3d MDS embeddings of the SRVF distances