getting distances - Githubissues

Armand1 commented 4 years ago

I am finding distances (euclidean) on a subset of the 821 vases to see if I can cluster them into smaller groups. I have done this before, but this time --- because I am dealing with a real heterogeneity of shapes -- I noticed something disturbing.

I standardize my vases by size and position: a kind of pseudo-registration. For each vase, I have a bunch of y values: the particular y values differ between the vases (since they were originally from very different image sizes). But to make a distance matrix you need to compare the same variables (y values) across your vases. To do that I bin the y values. But that doesn't quite work because there are big "gaps" in the y values. See below:

vasepoints

The points are the actual y values, the lines connect them. You can see that, for many vases, there are big bits where there is a line but no point. That's where you chopped off a handle or something: you just drew a line. That's fine: but when I "bin" the y values, I get a string of "NA"s there; and the distance matrix function does not like that. And I am not sure how to fix it: interpolate?

How do you deal with this when estimating your distances --- or does the problem simply not arise since you're working with SRVs or whatever?

Armand1 commented 4 years ago

The answer is: use a smoother. Here is my crack at a smoother (gam, bs="cr", k=20). If k (wiggliness) is set too high you get unwanted curves in the gaps; too low you don't get the details around the lip of the vase. This seems like a reasonable compromise. Here the coloured points are the original data; the red line the smoother.

athenianwithsmoothers

Armand1 commented 4 years ago

So, I want to reduce the number of individual vases by averaging them. But which ones should we average? Clearly we need to cluster in some fashion. I focused on 93 athenian BF and RF vases. I have tried making an NJ tree, getting the bootstraps, collapsing them into clades. That kinda works, but intrinsically leaves too many singletons. I tried Gaussian mixture modelling (raw data and PCA) and that failed to give sensible clusters. I searched for a K means solution: the best k means is k=2, but there's a good solution at k=30. That's more like it. All this is based on euclidean distance and the smoothers (see above).

Here, I have asked whether the clusters have a single shape as determined from the metadata. The blue ones do; the orange ones don't. They work pretty well: the orange clusters nearly always suck up closely related vases (e.g., pelikes and amphoras). The kraters mostly separate nicely.

There's clearly room for improvement. For example, by a better distance metric or a better clustering method. I am not so sure that off-the-shelf clustering methods really are what we need.

Anyway, my idea is that ultimately we will cluster our vases by shape, then for each vase combine its cluster with its fabric and date, so that a given vase might belong to "athenian:-525:cluster7." Vases with this designation are grouped by Karcher means and those means become the unit of analysis. This should reduce the number of taxa by perhaps a third or so, maybe more.

smarsland / pots

getting distances #15