veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
201 stars 68 forks source link

How to handle likely codon bias in test clade (endosymbionts) #1612

Closed liamfriar closed 11 months ago

liamfriar commented 1 year ago

Hi,

I am using RELAX with my test group being a monophyletic group of endosymbiotic bacteria and the background group being all the other bacteria in the larger phylogenetic tree of the bacterial order. (I left the higher nodes that contain both the endosymbiotic clade and some of the free-living lineages as unlabeled). I know that these endosymbionts have an AT bias, and am curious if I need to account for this in the gene trees that I input into HyPhy RELAX, or if RELAX accounts for this bias itself so it doesn't matter.

In issue #1579 , @spond mentions that the input branch lengths can help speed up the algorithm, but don't really matter. And, both Kovasky Pond and Frost, 2005, Mol. Biol. Evol. and the tutorial from Spielman et al, 2017 both seem to suggest that codon biases are corrected for.

But in Wertheim et al, 2014, Mol. Biol. Evol., it says:

The endosymbiotic gamma-proteobacteria tend to artificially cluster in phylogenic inference due to their exceptionally low GC-content. This bias can be corrected by using nonhomogeneous nucleotide substitution models in phylogenetic inference (Galtier and Gouy 1995, 1998). Therefore, we used a phylogeny inferred by Husník et al. (2011), who used this type of nucleotide model in their analysis. (It looks like Husník et al (2011) used nhphyml http://pbil.univ-lyon1.fr/software/nhphyml/)

So, I just want to clarify to make sure the analyses are as good as possible for my data. Thank you so much for making and maintaining this tool and for responding to multiple questions by me!

spond commented 1 year ago

Dear @liamfriar,

If you have significant within-dataset compositional biases (some sequences with high and some sequences with low GC contents, for example), then there is a risk of confounding with existing methods.

Almost all methods (except FADE or DEPS, which look for directional selection on amino-acid data), in HyPhy assume that character frequencies are at equilibrium, so there's no large change in GC (or other compositional) biases in the data. It is not immediately clear what the cost of violating this assumption might be. Intuitively, the model might compensate for something it cannot directly handle (changing composition), by something it can adjust (ω and other rates). I have not conducted systematic studies here, so can't provide specific guidance.

A statistically principled solution would be to design models which allow frequency shifts over the tree. The simplest such models would be non-reversible codon models, which are not widely (if at all) used in practice. I've been asked to implement something along those lines by a number of collaborators now, so perhaps it's time. There are some technical complications, however.

You could do a couple of things to explore what's happening in your data.

  1. If you can root the tree reliably, check out https://github.com/veg/hyphy-analyses/tree/master/NucleotideNonREV for a simple test of substitution process non-reversibility.
  2. If you have two clades with very different AT/CG biases, one way to check if this is a confounder is to run BUSTED on (a). Joint alignment (high and low AT in the same tree), labeled as 'Test', 'Reference'. (b). Each clade separately.

Then you can compare the ω distributions that are returned for Test and Reference branches in the joint analysis with those where you analyze only Test sequences and only Reference sequences.

Assuming the ω values don't change too much, you could have some confidence that inference is not too biased. If it is biased, report back here, and we can think of next steps.

Best, Sergei

liamfriar commented 1 year ago

Hi @spond , I have managed to do both of your suggested approaches from above, applying each to ~200 single copy genes from 62 genomes. I am having some trouble with some of the alignments as I mentioned in #1615 . But I now realize that I don't really know how to analyze the results...

Do you have any suggestions for how to collect the results for the ~200 genes for each approach into a useful answer to the original question? (Is nucleotide bias going to be a problem for BUSTED and RELAX)?

spond commented 1 year ago

Dear @liamfriar,

For (1), do you see evidence that NREV (non-reversible) models are preferred.

For (2), compare ω distributions and p-values for both approaches.

You can attach some example stdout output if you'd like, I'll take a look.

Best, Sergei

github-actions[bot] commented 11 months ago

Stale issue message