veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
216 stars 69 forks source link

Pattern of mean ω distribution #758

Closed francicco closed 6 years ago

francicco commented 6 years ago

Hi everyone,

I have a question concerning a study I'm doing on an alpine ant species. I'm comparing it with 21 other species looking for selection in more than 3000 single-copy. I used the aBSREL method for all branches, because I wanted to see the behaviour of the Mean_dNdS between my species and the others. Once Mean_dNdS was plotted:

mean_dnds_ants

I noticed significant shifted distribution in my target species (Talp). I'm interpreting this as many genes are moving away from a general purifying selection, maybe moving, at least, towards the neutrality. I would like to validate this hypothesis by running the RELAX test on my target species, and I found this distribution of K:

k_density

It looks something like a bimodal distribution, in which a group of genes (the ones with K<0.5) are gradually shifting towards the neutrality. Now I would like to run this test also on its sister species (~5Mya apart), to confront their distribution. I should expect to find a distribution with more centered around 1. Do you guys think my interpretation make sense?

Thank a lot in advance. Regards Francesco

spond commented 6 years ago

Dear @francicco,

I would not necessarily expect anything specific for the sister species; did you use the rest of the tree as the reference set of branches? Remember that K is always defined relative to something, and the choice of reference is very important.

Also, keep in mind that all K are inferred with (sometimes significant errors) whose magnitude is not independent of the value of K. When doing kernel density estimation, consider using something that accounts for such error (e.g. https://arxiv.org/pdf/0805.2216.pdf, which I am not sure is directly applicable here).

Best, Sergei

francicco commented 6 years ago

Hi @spond, Yes, as reference I used the rest of the tree.

And this is the K behaviour is a specific set of genes

k_dist_hsps

k_hsp_density

A Wilcoxon signed-rank test shows a significance distribution between Talp and the other species.

My interpretation would be a stronger relaxation in the protein coding genes of this species. Does it make sense?

Best Francesco

spond commented 6 years ago

Dear @francicco,

Based on your plot one could expect differences between Talp and the outgrip species. So, yes, the outgroup should support a distribution with more density near K = 1.

Although, of course you are know doing "data-driven" hypothesis testing (multiple testing on the same data), which will create bias to discover positive results.

Best, Sergei

francicco commented 6 years ago

Although, of course you are know doing "data-driven" hypothesis testing (multiple testing on the same data), which will create bias to discover positive results.

Yes, I'm aware of that, but I will not discuss which gene is in relaxation in the other species, but just that in 'Talp' the purifying selection is smaller, and relaxation of a higher magnitude. Correct? F

spond commented 6 years ago

Dear @francicco,

Yeah, so long as you report that you did aBSREL (which prompted you to), then did RELAX, it's OK. Looks like the signal is quite strong and could survive multiple hypothesis testing corrections.

Best, Sergei

francicco commented 6 years ago

Ok! Thank you @spond

Best, Francesco

francicco commented 6 years ago

Dear @spond, one last clarification.

This are still a bit previsional, yet with a quite good signal. This is the 2D density plot of 3 species

2d_density_3sp

Clearly visible the different behaviour of Talp, which also shows a (I don't know how to call it) deviation along K=1. Those, as you can see here,

2d_density_talp

are Talp branches with a significant sign of selection (aBSREL, corrected for multiple test using Benjamini, Hochberg, and Yekutieli method).

How those branches are along K=1 and not in K>1 as expected? Because of the K error you were mentioning?

spond commented 6 years ago

Dear @francicco,

What is the X axis in this plot? Is it the mean dN/dS inferred under aBSREL?

Best, Sergei

francicco commented 6 years ago

yes

spond commented 6 years ago

Dear @francicco,

This is interesting.

  1. In our RELAX paper we talk about how intensification of positive selection might look like relaxation of selection, as measured by dN/dS. So it is possible to have K < 1 while still increasing dN/dS > 1. It really depends on how much positive selection signal there is in the rest of the tree.

  2. On the Y-axis plot the mean branch dN/dS for talp inferred under the RELAX model. You can compute it from the ω distribution and the K value. I would be curious to see if there is good correspondence between the two.

  3. What proportion of sites is assigned to ω > 1 by aBSREL?

  4. For talp, does aBSREL allocate a lot of weight to low ω values?

  5. What does the ω distribution look like under RELAX?

Best, Sergei

francicco commented 6 years ago

Ok, here we got:

  1. On the Y-axis plot the mean branch dN/dS for Talp inferred under the RELAX model. You can compute it from the ω distribution and the K value.

How do I find it? It looks like it's not in the output log

  1. What proportion of sites is assigned to ω > 1 by aBSREL?

This is how WtOmegaOver1 looks along terminal branches:

wtomegaover1_dist

And this is correlated with branch length:

wtomegaover1vsbranchlength

  1. For Talp, does aBSREL allocate a lot of weight to low ω values?

A lots of point has WtOmegaOver1 = 0. wtomegaover1vsmomega

  1. What does the ω distribution look like under RELAX?

I don't know how to answer this.

best, F

francicco commented 6 years ago

Hi @spond,

  1. In our RELAX paper we talk about how intensification of positive selection might look like relaxation of selection, as measured by dN/dS. So it is possible to have K < 1 while still increasing dN/dS > 1. It really depends on how much positive selection signal there is in the rest of the tree.

I think I understand this point. I'll try to reframe it to see if I understand it. If a locus has strong positive selection in the phylogeny, but less in the testing branch, this branch will show positive selection, but its proportion of ω > 1 will look smaller compared with the rest of the phylogeny. If that is true, it makes totally sense, and should means that if I take allo loci under selection in Talp with K < 1, they should have a smaller WtOmegaOver1 compared to other branches. And on the other side if I take loci under selection in Talp with K > 1, these should be under positive selection in a more species specific manner. Is that right?

Best, F

spond commented 6 years ago

Dear @francicco,

  1. (and 4) To compute the RELAX distribution, read the JSON file output by the analysis and extract the distribution from the model fits object (http://hyphy.org/resources/json-fields.pdf). I am curious to see what proportion of weight is attributed to very low ω values.

  2. Interesting; it seems that talp experience very strong selection over a small subset of sites. This means that the majority of the lineage is still conserved (which would tie back to point 0 above)

I think I understand this point. I'll try to reframe it to see if I understand it. ...

Imagine that on most branches in the tree you have strong conservation, say

ω weight
0.01 90%
0.25 9 %
1.5 1 %

Now say, you have positive selection along a small subset of sites along specific test lineage where the distribution looks like

ω weight
0.1 70%
0.8 25 %
3 5 %

Under the RELAX model, the test branch obtains its &omegas; from the background by raising them to power K, so for K = 2, you will have the test branch looking like (assuming the rest of the tree dominates the likelihood function, so changes to baseline ω are negligible).

ω weight
0.0001 90%
0.0625 9 %
2.25 1 %

and for K = 0.75:

ω weight
0.1 90%
0.5 9 %
1.22 1 %

As you can see, for K = 0.75 the transformed distribution more closely matches the true (but disallowed by the model) desired distribution than for K = 2.

Best, Sergei

francicco commented 6 years ago

Dear @spond

I think I understood the example, but I'm confused about what do you want me to plot. Would you explain it to me again, sorry.

Best, Francesco

francicco commented 6 years ago

Ok @spond,

let me give another try. This are the distribution of the proportion of "purifying selection" (as the lowest rate of omega, among the 3 rates - low, medium and high omega rates) between loci under significant positive selection with K < 1 (right skewed density plot) Vs K > 1 (left density plot).

proportpurselection

According to the Wilcoxon Rank-Sum test the two distributions are significantly different (P = 0.01746). That should prove that K is less then one because the stronger purifying selection?

Am I misinterpreting things?

francicco commented 4 years ago

Dear @spond,

Sorry to keep commenting on this topic. I'm trying to publish the results of this study (here), and I found a referee quite hostile in the way I interpret our results. To summarise. I assembled and analysed an alpine (T alpestre) and and 4 other closely related species. What I found was a significant skew in the distribution of both ω (Figure a) and K (Figure b), with the most related species (T. immigrans) with an intermediate and bimodal distribution of K. Without going too much into details, I interpret these results in this way:

The two evolutionary trajectories were predicted in our hypothesis, and while diversifying selection is somewhat expected in species adapting to a new environmental niche, the magnitude of relaxed selection found is more surprising and never recorded. Although both ω and k distributions in T. alpestre are highly skewed, the absolute number of genes under diversifying and relaxed selection are not markedly greater than other species in similar studies (Cicconardi, Marcatili, et al., 2017; Harpur et al., 2014; Roux et al., 2014). This cannot be directly ascribed to the type I error rate because the same rate should be equally randomly present in all the other short or long branches of the phylogeny. Nevertheless, T. alpestre, as part of a species complex (Steiner et al., 2010; H. C. Wagner et al., 2017), may have more recently diverged from more closely related species not yet adapted to the alpine habitat. Magnitude of this relaxed selection is evidenced by the 70 enriched biological processes, and can be seen as the consequence of a shift and/or decreased magnitude in the purifying selection.

Screenshot 2019-11-14 at 12 32 35 PM

The referee commented my consideration in this way:

by studying a single species thriving in the extreme habitat, it is impossible to infer that the pattern of genome-wide relaxed selection is due to adaptation to this habitat, and not, for instance, to a bottleneck that occurred for any other reason; there is actually no a priori reason to expect genome-wide related selection, outside of a few genes not needed anymore in the new habitat

How can I reply? He/She actually right? I don't really think a bottleneck can really give this king of signal.

Thanks a lot, Best F