veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
211 stars 69 forks source link

Expected behaviour of multihit and synonymous rate variation in aBSREL #1647

Closed casparbein closed 9 months ago

casparbein commented 1 year ago

Dear @spond,

we read your new article about the implementation of multihits and synonymous rate variation (https://academic.oup.com/mbe/article/40/7/msad150/7217158) with great interest, and did some test runs on a dataset of 2500 transcripts of up to 70 species with BUSTED and aBSREL (HyPhy version 2.5.51). For both programs, we did a run with default parameters and one with MH and SRV enabled. For each alignment, the commands we used were as follows (here depicted is the +MH +SRV case):

## aBSREL
hyphy absrel --alignment ${transcript_id}.fa  \
--tree ${transcript_id}.nh ENV='TOLERATE_NUMERICAL_ERRORS=1;' \
--multiple-hits Double+Triple --srv Yes \ 
--output ${transcript_id}.json &> ${transcript_id}.log

## BUSTED
hyphy busted  --alignment ${transcript_id}.fa \
--tree ${transcript_id}.nh ENV='TOLERATE_NUMERICAL_ERRORS=1;' \
--srv Yes --multiple-hits Double+Triple \
--starting-points 10  \
--output ${transcript_id}.json &> ${transcript_id}.log

In case of BUSTED, we basically get results that are in line with the results discussed in your article: More transcripts with significant p-Values in the default run than the +MH+SRV run and lower p-Values in default mode, but generally better model fit (as measured through delta AICc) for the +MH+SRV run. Here are two histograms showing these patterns.

First, we have the absolute p-Value difference between raw(default) and mh_srv, and the negative part of the histogram is more populated than the positive, indicating that in many cases the p-Value of the default run is lower than that of the +MH+SRV run: grafik

Second, this histogram shows model fit as the difference between raw(default) and mh_srv AICc values, where negative values indicate that the AICc value for the +MH+SRV run was higher than that of the default run. The majority of the values fall into the positive range of the histogram, illustrating that in most cases, +MH+SRV AICc is smaller than the default. grafik

For aBSREL, however, the picture is very different. While for the majority of the branches, p-Values are the same, for the branches where p-Values differ, they often are either significant in the default run or in the +MH+SRV run. So there is not only no clear reduction in significant p-Value branches, putatively removing false positives, but also the addition of new significant branches under the +MH+SRV regime. Moreover, for almost all transcripts, the default model has a lower AICc value than the +MH+SRV model. I'll attach the histograms visualizing this here.

p-Value absolute difference (only for p-Values that differ): grafik

delta AICc values: grafik

The better model fit for the default run could be explained by aBSREL having to fit many more parameters when invoking SRV and MH than BUSTED, something that always gets penalized by statistics like AIC. In your opinion, is this explanation valid? All in all, we are now very much in need of some suggestions on how to interpret these results and recommendations of how to run aBSREL. Also, running +MH+SRV takes up to 10x or more than running in default mode, making runs across thousands of genes very time-consuming. In your experience, is parallelizing an option for speed-up, given that it might introduce some stochasticity into the results (https://github.com/veg/hyphy/issues/1601)?

Thanks a lot in advance, looking forward to hearing your insights.

Cheers, Bernhard

spond commented 1 year ago

Dear @casparbein,

The biggest difference between how MH is handled by BUSTED and aBSREL is that for the former, the multi-hit parameters δ (and ψ) are alignment-wide, whereas for the latter they are branch-specific. So by turning on MH, you get +1 or +2 parameters for BUSTED but +B or +2B (B = number of branches) in aBSREL. That explains the massive AIC penalty you are seeing in the final histogram.

The MH implementation in aBSREL is somewhat experimental and I have not systematically benchmarked it. Since you are doing it (thanks, by the way), I'll add the option for the next release to infer MH rates globally as well for aBSREL so the comparison is more apples to apples.

I would be curious to know what kind of estimates for δ ψ and ω you are getting on branches where aBSREL p-value is NOT significant, but aBSREL+MH IS significant. This information is stored in per-branch dictionaries in the output JSON

image

Best, Sergei

casparbein commented 1 year ago

Dear @spond ,

Thanks for your quick reply!

I checked the estimates for δ, ψ and ω on branches with significant p-Values for only the +mh +srv run. In the test dataset, there are ~1,700 such branches, and in virtually all of them, the omega value as given by Baseline MG94xREV omega ratio is absolutely through the roof, equalling 10,000,000,000. This is the case, however, for both the default and the +mh+srv run, the values very rarely differ. The only difference is the corrected p-Value, which is around 0 in case of +mh+srv run, and 1 for the default run. The odd thing is that rate at which 2/3 nucleotides are changed instantly within a single codon is almost always 0 (there are 42 branches where the double rate is over 0 and 8 branches where the triple rate is over 0).

This looks like the mutlithit functionality interferes with how the p-Value is computed, perhaps in cases where omega estimates are extreme. I'd say that all these cases are artifacts and not branches really under EDS, though no idea why this happens.

Do you have experience with running aBSREL with just SRV turned on? Is that something you would generally recommend? Seeing these 'artifacts' in our benchmark, we will for now proceed and run our screen without enabling mutlithits, but we have not yet tried only SRV.

Cheers, Bernhard

spond commented 1 year ago

Dear @casparbein,

That's definitely odd. Would you mind sharing ONE such example for me so that I could better understand what was going on: input file and JSON outputs for +/- MH runs?

The SRV option should be much more predictable. It simply adds an alignment-wide over-site distribution of α rates. I would expect that its behavior will closely mirror BUSTED+S.

Best, Sergei

casparbein commented 1 year ago

Dear @spond,

I sent you an email containing JSON output, log files and inputs for a transcript where we see 7 branches with significant p-Values in +mh+srv and 0 branches in the default mode.

Looking forward to hearing your insights. Cheers, Bernhard

spond commented 12 months ago

Dear @casparbein,

I was able to track down the source of this bug. It has to do with parameter-heavy models (+SRV +MH) getting "stuck" on parameter boundaries (0 branch lengths). I'm going to update HyPhy to v2.5.55 early next week and the fix will be included.

Thanks again for reporting this issue.

Best, Sergei

pkfsantos commented 11 months ago

Dear @spond, I just found out about this discussion. I was about to start running aBSREL and RELAX. Should I wait for the release of v2.5.55 or would you recommend using the current version (v.2.5.53) without accounting for multiple hits? I also wonder if this same issue applies to RELAX. Thank you, Priscila

spond commented 11 months ago

Dear @pkfsantos,

I am going to release 2.5.55 this weekend, so I would recommend you wait. The same issue is unlikely to apply to RELAX, but if you are encoutering odd behavior, please open up a separate issue, and I will be happy to take a look.

Best, Sergei

github-actions[bot] commented 9 months ago

Stale issue message