Closed liamfriar closed 1 year ago
Dear @liamfriar,
If you have significant within-dataset compositional biases (some sequences with high and some sequences with low GC contents, for example), then there is a risk of confounding with existing methods.
Almost all methods (except FADE or DEPS, which look for directional selection on amino-acid data), in HyPhy assume that character frequencies are at equilibrium, so there's no large change in GC (or other compositional) biases in the data. It is not immediately clear what the cost of violating this assumption might be. Intuitively, the model might compensate for something it cannot directly handle (changing composition), by something it can adjust (ω and other rates). I have not conducted systematic studies here, so can't provide specific guidance.
A statistically principled solution would be to design models which allow frequency shifts over the tree. The simplest such models would be non-reversible codon models, which are not widely (if at all) used in practice. I've been asked to implement something along those lines by a number of collaborators now, so perhaps it's time. There are some technical complications, however.
You could do a couple of things to explore what's happening in your data.
BUSTED
on
(a). Joint alignment (high and low AT in the same tree), labeled as 'Test', 'Reference'.
(b). Each clade separately. Then you can compare the ω distributions that are returned for Test
and Reference
branches in the joint analysis with those where you analyze only Test
sequences and only Reference
sequences.
Assuming the ω values don't change too much, you could have some confidence that inference is not too biased. If it is biased, report back here, and we can think of next steps.
Best, Sergei
Hi @spond , I have managed to do both of your suggested approaches from above, applying each to ~200 single copy genes from 62 genomes. I am having some trouble with some of the alignments as I mentioned in #1615 . But I now realize that I don't really know how to analyze the results...
Do you have any suggestions for how to collect the results for the ~200 genes for each approach into a useful answer to the original question? (Is nucleotide bias going to be a problem for BUSTED and RELAX)?
Dear @liamfriar,
For (1), do you see evidence that NREV (non-reversible) models are preferred.
For (2), compare ω distributions and p-values for both approaches.
You can attach some example stdout
output if you'd like, I'll take a look.
Best, Sergei
Stale issue message
Hi,
I am using
RELAX
with my test group being a monophyletic group of endosymbiotic bacteria and the background group being all the other bacteria in the larger phylogenetic tree of the bacterial order. (I left the higher nodes that contain both the endosymbiotic clade and some of the free-living lineages as unlabeled). I know that these endosymbionts have an AT bias, and am curious if I need to account for this in the gene trees that I input intoHyPhy RELAX
, or ifRELAX
accounts for this bias itself so it doesn't matter.In issue #1579 , @spond mentions that the input branch lengths can help speed up the algorithm, but don't really matter. And, both Kovasky Pond and Frost, 2005, Mol. Biol. Evol. and the tutorial from Spielman et al, 2017 both seem to suggest that codon biases are corrected for.
But in Wertheim et al, 2014, Mol. Biol. Evol., it says:
The endosymbiotic gamma-proteobacteria tend to artificially cluster in phylogenic inference due to their exceptionally low GC-content. This bias can be corrected by using nonhomogeneous nucleotide substitution models in phylogenetic inference (Galtier and Gouy 1995, 1998). Therefore, we used a phylogeny inferred by Husník et al. (2011), who used this type of nucleotide model in their analysis. (It looks like Husník et al (2011) used
nhphyml
http://pbil.univ-lyon1.fr/software/nhphyml/)So, I just want to clarify to make sure the analyses are as good as possible for my data. Thank you so much for making and maintaining this tool and for responding to multiple questions by me!