Closed melop closed 3 years ago
I agree that rooting the tree with an arbitrary in-group sample is presumptuous and misleading. A clear outgroup should be used to root the tree.
But RaTG13 is not the ancestor, although being in the same species it is probably closer to the common ancestor than SARS-CoV-2. I think the trees https://nextstrain.org/groups/blab/sars-like-cov and https://nextstrain.org/groups/blab/beta-cov are fine, maybe just make the font of the link Phylogenetic context of nCoV in SARS-related betacoronaviruses can be seen here in ncov page larger ?
As well as an explanation of how to translate the number of mutations to a sars-related divergence (add the #mutation in each branch from the common ancestor then divide by 30000)
bat/Yunnan/RaTG13/2013 is the closely known outgroup of human SARS-nCov-2, which can be used to root the human SARS-nCov-2 tree. It does not need to be the direct ancestor, it only needs to share a common ancestor with human SARS-nCov-2. This is analogous with rooting the human population tree with a chimpanzee outgroup. In fact, an outgroup is a sample that you know for sure does NOT belong to the group of interest that you are studying. See https://en.wikipedia.org/wiki/Outgroup_(cladistics) .
This is also clear from https://nextstrain.org/groups/blab/sars-like-cov .
Using a wrong root will screw up not only the topology but also the divergence time estimates.
Mathematically choosing outgroup root is wrong when you already have a very good approximation of the common ancestor. Imagine RaTG13 is at the origin in 2d and you choose a point far way (Wuhan patient 0) and draw a small circle around it and sample 2000 random points in that disk. The closest to the origin will never be close to the center of the circle. This is exactly what you'll get if choosing RaTG13 as the root.
Rooting with an outgroup is a standard practice in phylogenetics. The root need not be "at the center of the circle". The average divergence between RaTG13 and human SARS-nCOV2 is only <4%, that is, 4 differences in every 100bp. This level of divergence will not result in problems in sequence alignment or saturation/homoplasy.
I sent this thread to an evolutionary biologist and an genomic anthropologist who studies pathogen evolution https://clas.uiowa.edu/anthropology/people/drew-kitchen . The unnamed evolutionary biologist thought it was important to include ne or more outgroups to establish the earliest branches in a tree, but deferred to the greater expertise on the particular subject to Kitchen who said, "There are some complications with using outgroups in an analysis of a population that arise, especially with dating. The issue is that the outgroup is definitionally not part of the population, which means the coalescent assumptions are invalidated - this inflates population size estimates and date estimates. Additionally, if you are using dates to estimate rates, you can only use in-group (i.e., population) samples to estimate the rate as there is rate decay over time - intra-population substitution rates are higher than between species/population substitution rates. That means the branch from the MRCA of the population sample to the outgroup is the product of substitutions that are occurring at different rates than the substitutions that are segregating in a population."
As a two-step way to first determine the most recent common ancestor and then calculate dating, he says, "One work around would be to use an ML/Bayesian analysis with an outgroup to determine the topology near the MRCA of the in-group. This could then be fixed in the dating analysis, using the sampling dates of the in-group sequences."
I agree. This is how it should be done. Only exclude the outgroup at the dating stage, but not prior to that.
It is also important to collapse nodes with no statistical support to avoid misleading interpretations.
There are differing opinions about whether representations should collapse poorly supported branches (because we believe there is little support for their existence) or whether they should include them (because, theoretically, the unknowable true genealogy is a bifurcating tree and we should just hold in our head that parts of the tree are uncertain). This is an ongoing point of minor contention in phylogenetics/phylodynamics, as observed by some people choosing consensus tree representations whilst others using maximum clade credibility representations (which finds uses the one topology that maximizes support for all branches as the representation that is annotated).
However you do it, everyone should be cautious interpreting parts of the tree/genealogy that are uncertain. Having said that, uncertainty in parts of the genealogy does not invalidate inferences made from the overall shape of the genealogy (such as population sizes/number of infections or migration matrix estimates).
Just my $0.02
Closing this issue. I will note that rooting with RaTG13 resulted in spurious conclusions of this manuscript by Foster et al (https://www.pnas.org/content/117/17/9241) which was widely criticized (https://www.pnas.org/content/117/23/12522, https://www.pnas.org/content/117/23/12518, https://www.pnas.org/content/117/23/12520).
I recommend rooting the phylogeny with the RaTG13 genome from Rhinolophus affinis, based on the latest finding that this is the currently known most closely related genome to human SARS-nCov-2 (Andersen et al., 2020; Nat. Med.). https://www.nature.com/articles/s41591-020-0820-9. The current arbitrary choice of the root could result in problems in the dating of internal nodes.