veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
211 stars 69 forks source link

aBSREL: 100% positively selected sites with high omega values #1628

Closed Emilyaoc closed 11 months ago

Emilyaoc commented 1 year ago

Hello,

I have a few questions about how to interpret results from aBSREL that indicate 100% of sites on a branch are under positive selection (significant according to the uncorrected P-value from the LRT). I note that in my data these cases often also have very high omega values (e.g. 10000000000). I understand that very high omega values can come about when the synonymous substitution rate is negligible. So I can see how this may occur over a few codons. But it seems less likely to be the case over several hundred codons (most of the genes in my dataset are at least a 300 codons long). Biologically, I also question whether every codon across a gene is ever likely to be under strong positive selection. Would you be able to help me understand how aBSREL results of 100% of sites being under positive selection with omega values of 10000000000 is likely to come about? I'm not sure if these results indicate a problem with my data or are evidence of strong positive selection?

Many thanks for any thoughts or help you can offer.

Emily

spond commented 1 year ago

Dear @Emilyaoc,

This usually indicates a very short branch with only non-synonymous changes ascribed to it. Check the Branch Length associated with this branch.

Best, Sergei

Emilyaoc commented 1 year ago

Yes, it does seem to often be associated with short branch lengths. Am I right in thinking that, generally, if you have sequences from some closely related species included in your dataset, then these are often likely to result in short terminal branches? In which case the chance of getting results with 100% positively selected sites & very high omegas may be linked to sampling?

I can see why short branch lengths with only non-synonymous changes could generate a very high omega estimate, but I'm not sure I follow why this is likely to also lead to all sites having the same rate category? I could more see that sequences with just a few substitutions (leading to short branches?) all of which are non-synonymous would lead to very high omegas but across a small proportion of sites. Or am I getting confused with the differences between counts and the estimated rates of substitutions here?

Thank you for your patience with helping me understand and apologies if I am missing something obvious.

spond commented 1 year ago

Dear @Emilyaoc,

The counterintuitive part is that perfectly conserved sites (no variation), can provide no useful information on rate ratios. That's because when divergence is 0 (as it would be for invariable sites), we can't extract any information about dN/dS or any other ratio: it's like asking what fraction of "0" came from non-synonymous changes. So you effectively have exclusively non-synonymous sites driving the estimate to infinity.

Best, Sergei

Emilyaoc commented 1 year ago

Dear Sergei,

Okay, thank you for explaining. Does that mean then that invariant sites are effectively ignored? So, if most sites are invariant and the others have only non-synonymous substitutions you could end up with all sites being estimated to be under positive selection with a very high omega category?

Something I notice with the aBSREL results from my dataset (400 genes with 48 sequences per gene) is that the proportion of sites under positive selection is never between 0.45 to 0.99 (see attached figure: note this is just the data for cases of significant positive selection according to the uncorrected P-value from the LRT and that I capped omega at 50 so any values from 50 onwards have been set to 50 just for plotting). Is the fact that the proportion of positive sites is never between 0.45 to 0.99 just an odd feature of my data? Or something inherent to how aBSREL estmiates the proportion of positively selected sites?

Thank you for any thoughts you have

Emily Rplot fig

spond commented 1 year ago

Dear @Emilyaoc,

Invariant sites contribute (importantly!) to divergence level estimation, i.e. branch lengths, and some rate estimation. For example, if you allow synonymous rate variation, invariant sites will drive "low" dS (and dN!) rates.

But yes, for short branches, where there are a lot of invariable sites, and only non-synonymous substitutions, you could find positive selection results (assuming it clears p-value testing). In some sense, this is expected; and is plausibly correct. One thing to look out for would be a single multi-nulcotide hit (e.g. AAAACC), which could inflate false-positive rates. You can always add --multiple-hits Double or --multiple-hits Double+Triple to see what effect such MN changes might have.

There's no statistical reason why the proportion of sites under positive selection would not be between 0.45 and 1. For example (70% with dN/dS = 10), given enough data...

hyphy SimulateMG94.bf --site-variation gdd --tree trees/2.nwk --gdd-distribution 0.0,0.3,10,0.7 --output sims/2 --replicates 10 --sites 5000  
hyphy absrel --alignment sims/2.replicate.1.nex 
...

### Testing selected branches for selection

|              Branch               |  Rates   |     Max. dN/dS     |      Test LRT      |Uncorrected p-value |
|-----------------------------------|----------|--------------------|--------------------|--------------------|
|                 B                 |     2    |    9.98 (66.34%)   |       572.71       |       0.00000      |

Your plot is quite curious though. Can you do a heatmap, or hexplot binning by ω and branch length? My guess is that most of the 1.0 points are from short branches.

Best, Sergei

Emilyaoc commented 1 year ago

Dear Sergei,

Here is the same plot with the datapoints coloured by branch length. Just to check, for branch length I used the "Full adaptive model" number from within the branch attributes. Is that correct? Note that all branch lengths are fairly short as these results are just from the 'leaves' (terminal branches) and my dataset contains many pairs of closely related species.

From the plot, perhaps not all cases of 100% positively selected sites are explained by short branches? But when it is the case that there's a very short branch and 100% positively selected sites (with a significant p-value), does that mean that I should interpret this to reflect many invariant sites and evidence of positive selection where there is variation?

I see that sites make an important contribution to the divergence estimation. But am I right in thinking that they do not necessarily contribute to the proportion of sites under positive selection then? It seems many variable sites with a higher rate of non-synonymous versus synonymous substitutions could lead to 100% positively selected sites, but so could the scenario of many invariable sites, and only non-synonymous substitutions. Have I understood this correctly?

Thank you for your help

Emily

PS could I check what units the branch lengths are in?

Branch lengths plot

spond commented 1 year ago

Dear Emily,

I suppose one way to think of this would like trying to estimate the slope of your linear regression where a large number of points are at (0,0), i.e. "invariable". All the information about the slope really comes from points that are not (0,0), i.e. "variable" sites. It does not really matter what fraction of the points is at (0,0); they are all perfectly explained by any slope, as long as the line goes through (0,0). For short branches, you effectively only have a couple of "not on zero" points, and if you have no synonymous changes, all those points are straight up the y-axis (x or dS = 0). Not a super-precise analogy, but it may help.

We used to have a "bivariate" model, i.e. a model where synonymous and non-synonymous rates were correlated, so you could estimate (0,0) and (x,y) rates for (dS, dN), but it's not included in the current set of selection models in HyPhy.

If your goal is to more precisely estimate the fraction of sites with dN/dS > 1, especially for shorter branches, aBSREL may not do too well. The "red-flag" would be branches where there are only non-synonymous substitutions inferred. aBSREL does not give you the counts directly, but 100% weight for dN/dS is pretty much a give-away.

Best, Sergei

github-actions[bot] commented 12 months ago

Stale issue message

Buckaroo1 commented 4 months ago

Dear Sergei, 亲爱的谢尔盖,

Okay, thank you for explaining. Does that mean then that invariant sites are effectively ignored? So, if most sites are invariant and the others have only non-synonymous substitutions you could end up with all sites being estimated to be under positive selection with a very high omega category?好的,谢谢你的解释。这是否意味着不变的位点实际上被忽略了?那么,如果大多数站点都是不变的,而其他站点只有非同义替换,那么您最终可能会估计所有站点都处于具有非常高的 omega 类别的正选择之下?

Something I notice with the aBSREL results from my dataset (400 genes with 48 sequences per gene) is that the proportion of sites under positive selection is never between 0.45 to 0.99 (see attached figure: note this is just the data for cases of significant positive selection according to the uncorrected P-value from the LRT and that I capped omega at 50 so any values from 50 onwards have been set to 50 just for plotting). Is the fact that the proportion of positive sites is never between 0.45 to 0.99 just an odd feature of my data? Or something inherent to how aBSREL estmiates the proportion of positively selected sites?我从我的数据集(400 个基因,每个基因 48 个序列)的 aBSREL 结果中注意到,正选择下的位点比例永远不会在 0.45 到 0.99 之间(参见附图:注意这只是显着正选择情况下的数据)根据 LRT 中未校正的 P 值进行选择,并且我将 omega 上限设置为 50,因此从 50 开始的任何值都已设置为 50(仅用于绘图)。阳性站点的比例从来不在 0.45 到 0.99 之间,这只是我的数据的一个奇怪特征吗?或者 aBSREL 如何估计正选站点的比例所固有的东西?

Thank you for any thoughts you have感谢您的任何想法

Emily 艾米丽 Rplot fig

Hi Emily,

I am a new user of aBSREL. Could you tell me how you got the omega values for each site?

Thanks!

Emilyaoc commented 4 months ago

Hi @Buckaroo1, aBSREL does not produce omega values per site but rather estimates how many categories of omega (rate categories) there are at each branch and the proportion of sites in each of these categories. So you get an omega value and a proportion for each of these rate categories for each branch. These values can be found in the .json file within the object for each branch under the "Rate Distributions" key-value pairs. There is a proportion and an omega value for each category. I wasn't familiar with the .json file format before using HyPhy and I found the HyPhy Vision website really helpful as you can upload one off the .json files from aBSREL and get a summary of the results. This helped me to then dig into the .json files and understand where certain key results are stored. Hope that's helpful! Emily