Intraspecific Data Analysis in one Population

DonBCilly commented 8 years ago

Dear All,

I am interested in identifying specific codons under selection from a dataset of 5 different genes, sampled from within the same species across a wide geographical area. The dataset consists of around 60 sequences per gene, with two sequences per individual site from 30 locations across Europe.

When I run the analysis through the DataMonkey server I find several codons under positive selection when implementing several methods. However, I am worried that my phylogenies are not very robust, as they contain only intraspecific data and often have very low bootstrap values for splits on the tree.

From reading around in the manuals, it seems like the IFEL approach would fit my purposes the best. However, as far as I understand, this approach was initially implemented to analyse two distinct subpopulations of viruses within separate hosts, making the phylogenetic inference of the two populations very robust, since they very obviously inhabit different selective environments.

Would it be correct to apply the same logic using outgroup sequences from recently diverged species in order to ask the question "is there evidence of selection following the split of these lineages within my species of interest?"

Apologies for my very rudimentary understanding oh phylogenetics and thanks a lot in advance!

Don.

spond commented 8 years ago

Dear @DonBCilly,

Here are my thoughts (in no particular order)

If you are interested in finding sites under selection, then don't worry too much about the uncertainty in the phylogeny. Us and others (Ziheng Yang and his group, for example) have done a number of computational experiments to arrive at the conclusion that site-detection methods are robust to errors in phylogenies. Your topology generally needs to be really wrong to affect the results of inference.
IFEL (and other methods which can focus only on internal branches, like BUSTED) is not going to do much to rectify the topological errors, but what it will do is remove some of the bias in dN/dS inference on intra-specific data. Essentially (see 1 and 2, for example) evolution within individuals is confined to terminal branches, and will include "unresolved" substitutions, i.e. non-synonymous changes that are either neutral (but have not been filtered by selection) or adaptive only within the context of the specific individual (e.g. HIV fixing immune escape conditioned on the genotype of the host). By looking only at internal branches, you remove much of the evolutionary noise.
Because all of the models used by us are time-reversible, adding an outgroup is not going to help "polarize" the substitutions. If you don't care about sites (but want to ask a gene-level question), then you should use BUSTED (restricted to internal branches), and for site-level inference, you should use IFEL. Use test.datamonkey.org to run BUSTED.
There is a version of FEL that compares two distinct populations, but it's not IFEL, and it's not implemented in Datamonkey.

Best, Sergei

DonBCilly commented 8 years ago

Dear Prof. Pond,

Many thanks for your prompt reply. I will give BUSTED a go and see what I can interpret :-)

Cheers,

Don.

veg / hyphy

Intraspecific Data Analysis in one Population #418