Experimental Methodology

rmcar17 commented 1 year ago

For discussion surrounding the best ways to make comparisons between this and other supertree methods.

The current method I have set up for birth death trees is to generate a model birth death tree with a specified amount of taxa. Then randomly sample the set of taxa and extract the subtrees until each taxa appears at least some pre-specified number of times (more-or-less). For each of these subtrees, a number of random NNI operations are performed.

What would be more realistic, as discussed during the meeting today, would be to sample within clades. Choose an internal node and sample its descendants. (I think I said something about choosing a taxa, and find the n closest... given that the tree is ultrametric, this I believe is equivalent anyway). Under this method, we would likely still want to sample across the root at least once to ensure there exists overlap (though if we don't, the proper cluster graph would be disconnected at the first iteration, meaning that all shallowest internal nodes that were sampled would appear adjacent to the root).

Need to also change how to handle the NNI operations, making the probability they are performed proportional to the inverse of the branch lengths.

GavinHuttley commented 1 year ago

Here's the response I received from Ziheng

Evolver in paml has an option of generating trees under the birth-death-sampling model.

I think for examining the statistical properties of tree inference methods, you should use fixed trees, rather than generating random trees by simulation. This is called frequentist simulation, in which the parameters (like the tree) are fixed when replicate datasets are generated. Some features of the tree may affect results, so you can use a few different trees (symmetrical versus asymmetrical trees, deep versus shallow trees etc), but for each fixed tree or for each fixed set of parameter values, you generate 100 or 1000 replicate datasets.

Some people in phylogenetics use random trees, but I think that is wrong. In such Bayesian simulation, every replicate data set is generated using a different set of parameter values sampled from a (prior) distribution. Bayesian simulation is useful for validating mcmc programs, but should not be used to evaluate the frequentist properties of inference methods.

We have examples of both in this paper. This also describes simulation of sequence data under the multispecies coalescent model with relaxed clocks.

Flouri T, Huang J, Jiao X, Kapli P, Rannala B, Yang Z: Bayesian phylogenetic inference using relaxed-clocks and the multispecies coalescent. Mol Biol Evol 2022, 39:msac161.

PAML link.

@rmcar17 I'll chat with you about this to interpret what he's saying.

rmcar17 commented 1 year ago

Balancing the Zhu Tree wouldn't work as discussed in today's meeting as the subtrees would of course not be balanced. Possible alternative forcing the birth-death trees to generate in a perfectly balanced way. Can do this by preventing births from happening at nodes at depth $\log_2(n)$ where $n$ is the number of taxa. Essentially just simulates edge lengths for a perfectly balanced tree.

rmcar17 commented 1 year ago

Another alternative would be to replicate experiments along the lines of https://academic.oup.com/mbe/article/34/9/2408/3925278?login=false, including the matching distance as a metric

rmcar17 commented 1 year ago

Decided upon. Moved relevant work to analysis repo

rmcar17 / SpectralClusterSupertree

Experimental Methodology #16