ms609 / TreeDist

Calculate distances between phylogenetic trees in R
https://ms609.github.io/TreeDist/
28 stars 6 forks source link

txc extremely low with MDS and still low with tsne #127

Open aguang opened 2 weeks ago

aguang commented 2 weeks ago

First, thanks for this extremely useful package and detailed explanations on tree metrics and visualizations in treespace as well as assessments of those visualizations.

I have a question about the relationship between assessment metrics like the silhouette coefficient and trustworthiness x continuity score and the properties of the trees.

I have a set of 220 (well, actually more like 11,000 but computing the distances was very slow so I just subsampled 220 of them) trees that I know fall into 11 different clusters. I wanted to visualize them in 2D tree space to get an understanding of how the clusters of trees differ, so I computed the Clustering Information Distance based off of the outlined recommendation in the tree space analysis vignette. I then plotted the trees with PCoA (with cmdscale) as well as a tSNE (with Rtsne), and additionally a UMAP (with uwot) just to see. The PCoA looked quite good, the tSNE looked interesting, and the UMAP looked rather similar to the tSNE when I modified the spread parameter.

However the txc score for both the PCoA and the tSNE is very low, well below 0.9 although the txc score for the tSNE is somewhat higher. Additionally, I tried the silhouette coefficient calculations outlined in the vignette, and got silhouette coefficients below 0.15. I am not really trying to cluster my trees since I have the clusters a priori, but I thought the coefficient would be higher to show that there is meaningful structure.

Could you help me understand what could be possible reasons for this? I am fairly certain the clusters have distinct features separating them. Each tree has ~1200 tips, so I was wondering if larger trees could result in lower scores due to the exploding number of possible topologies. I am happy to send you either the phylogenies or the distance matrix as well if it would be helpful.

PCoA and txc (colors are different for different known clusters) pcoa txc_pcoa

tSNE and txc: tsne txc_tsne

aguang commented 2 weeks ago

Nevermind, I realized that for PCoA I think I just needed to check it for more dimensions.

txc_pcoa

But I guess while we are here - if the txc score is higher for tSNE at lower dimensions than for PCoA, does that mean it is a better representation of the distance between the trees in lower dimensions?

ms609 commented 2 weeks ago

Hi, glad that you are finding the package useful!

Yes, I think it's fair to interpret the mapping with the highest txc score as being the most faithful representation of the true tree-to-tree distances. This doesn't necessarily mean that it will also do the best job of capturing all of the clustering structure, as it seems you have found. Clustering structure has many aspects (volume of clusters, their spatial relationships, distance between clusters, homogeneity of point density within clusters...) and different mappings may do a better or worse job of portraying different of these aspects.

If you are conducting the silhouette coefficient calculation on the original tree-to-tree distances, this should be a true reflection of the distinctness of the clusters. If you are performing the calculation on the mapped distances, then this will show how much of the clustering structure survived the mapping process – which is likely to be much less than the true structure.

aguang commented 2 weeks ago

It is on the original tree-to-tree distances (Clustering Information Distance). So that would suggest that my clusters are in fact not distinct? That doesn't make a lot of sense to me.

ms609 commented 2 weeks ago

The next thing I'd look at would be the silhouette score of individual points, using cluster::silhouette. plot(silhouette(distances)) is probably the best way to get an overview. This should show which clusters are contributing more or less strongly to the overall silhouette score (which is the mean of each tree's individual score).