ms609 / TreeDist

Calculate distances between phylogenetic trees in R
https://ms609.github.io/TreeDist/
30 stars 6 forks source link

Comparing trees with non-identical tips #92

Closed roblanf closed 1 year ago

roblanf commented 2 years ago

Thanks so much for the amazing package, and particularly the incredible documentation (could be a book??).

The docs suggest that we drop an issue if we have a use-case for comparing trees with non-identical tips, so here I am.

Use case

In phylogenomics we often sample 1000's of genes from our taxa of interest, and typically we are missing 1 or more taxa from most genes. For reference, here's a real-world example of the number of taxa in each gene tree from a published dataset of 8295 genes:

image

Since taxa are missing ~randomly, most of the cases with <100% of taxa will have non-overlapping taxon sets. This dataset is fairly representative. 16% of genes are sampled in all taxa, the rest are not. A good first approximation is that there are likely to be ~80%, or roughly 6500 different taxon sets.

I'd say that this is now very common (near universal) in modern phylogenomic studies. And most empiricist would love to be able to explore these tree sets in detail.

Useful things

The most general would be to get a matrix of normalised pairwise distances. E.g. using any suitably normalised distance metric, this should produce meaningful comparisons across all trees. This would also (I assume, maybe wrong?) allow for the visualisation of such tree sets. This seems to fit well within the remit of the package, while the next two perhaps don't.

Another useful thing would be the number of unique trees, using perhaps with options for what is meant by unique, e.g.: (i) strictly unique such that different taxon sets means unique; (ii) unique in the sense of non-conflicting (e.g. RF == 0 after reducing both trees to the common taxon set). Combined with this, grouping the trees into their unique sets would be useful.

Another thing (again I think beyond the purview of TreeDist, but I mention it in case this is something that may exist as an internal data structure of e.g. an RF calculation) is information on the observed splits in the data. I don't really know how one handles ambiguous splits in this case (e.g. a split on a tree with 42 taxa may be congruent with a large number of possible splits on the full tree of 52 taxa). One option would be to simply distribute the weight of these splits (i.e. a total weight of 1) over all possible splits with which they are congruent. Though perhaps this is too silly. The general point here is that users likely want to know which splits are common in their gene trees, and whether the common splits are all represented on their tree of interest (e.g. a species tree). Related work is on gene concordance factors, which are a summary statistic for this, but can still miss a lot of useful information about gene trees that are discordant with the species tree.

ms609 commented 2 years ago

Hi Rob,

Sorry for the slow reply – it's been a busy start to term.

Thanks for this detailed description of the use case. These certainly sound like tractable problems, and hopefully shouldn't take a great deal of work.

Initial areas on which I'll have to reflect:

Do let me know if you have any thoughts on these, or if there's anything I should consider.

Cheers,

Martin

roblanf commented 2 years ago

Hi Martin,

I want to bring in @jeetsukumaran here - apparently he has already thought through some of this and has some code in Dendropy (https://twitter.com/jeetsukumaran/status/1582243650569785344).

Honestly I don't have good solutions for your second issue. I wonder if it's solvable at all. Though I'd still wager that a visualisation with this limitation in mind may be more useful than no visualisation at all.

Rob

ms609 commented 1 year ago

Hi @roblanf and @jeetsukumaran: As far as I can tell, comparisons with trees containing non-identical tips should now be working – but I've not tried this out "in the real world", so I'd be very grateful for any feedback as you try this out! I'm still working on a way to produce tree space maps based on the resulting distances; I have the mathematics sorted out, but the implementation will take a little time to iron out.

roblanf commented 1 year ago

Oh wow! Thanks so much. I'll take a look with the dataset above.