smarsland / pots

1 stars 1 forks source link

clustering methods #20

Open Armand1 opened 4 years ago

Armand1 commented 4 years ago

I want to compare the various shape analysis methods. But to do so I want them to (1) have the same data going in and (2) the same evaluation method going out. Here I am concerned with the second step.

I have various shape analysis methods. One of them is Norman MacLeod's "eigenshape" method which, ultimately, describes a set of vases as a bunch of Principal components. I can cluster the Mantamados vases (25, in three different size-shape classes) on those PC eigenscores.

There are many clustering techniques, but one that works very well is a Gaussian mixture model implemented in Mclust. If I use that, specify k=3, and the best orientation of the data, I find that it divides the vases perfectly into the correct three classes:

GMM GMMclusters_Mantamados_eigenscores

So that's great! But problem: Mclust takes observational data (e.g., PC scores), but not a distance matrix. That's because it's a GMM: it actually models the distribution of the data. So I cannot use it for the results of the Diffeomorphic Shape analysis which yields a distance matrix --- one special to it.

Now, I can use a clustering method that takes a distance matrix. k Means can, but it basically assumes euclidean distances, so I don't want to use that. Tree-based clustering methods such as HCA or nj can take any distance matrix, but I think it's tricky to evaluate the congruence of a tree to the ground truth of three classes.

NJ tree Rplot02

hclust(ward.D2)

Rplot

How do I evaluate this? Using cuttree=3? It would not show a very good result.

Then there is K-mediods implemented in PAM in R. That also takes a distance matrix of any sort. It would seem well suited to our task. But it's not nearly as good. On the Mantamados data (with Euclidean distance) it has an accuracy score of only 71%.

K-mediods

PAM

So, here is the conflict. I want to use the best clustering method that any given technique allows; but I also want to use the same clustering method downstream of all shape analysis methods. But what if the best clustering method differs among shape analysis methods? Does anyone have any thoughts about other clustering methods we might use?

For the moment I think I shall just use the best clustering method for each shape analysis method. It then becomes part and parcel of any given shape-analysis method.