veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
205 stars 69 forks source link

Anyway to calculate omega for each branch of the tree? #1535

Closed jzhan6 closed 1 year ago

jzhan6 commented 1 year ago

Dear Dr. Pond,

Thank you for providing such a great and easy-to-use software for evolutionary analysis. Hyphy is amazing! I would like to know if there is any method in hyphy that can calculate omega for each branch in the tree? Thanks

In fact, I found both absrel and fitmg94 output the information. But I found results a bit confusing to me and need your help for understanding:

  1. For example In branch attributes output by FITMG94, mle is 0.03798 but in the absrel output Baseline MG94xREV omega ratio is 0.042491888. They looks the approximate but not the same. So are they the same parameter?

  2. I also found when -srv of absrel turned off, "Full adaptive model (non-synonymous subs/site)" in absrel is the same as non-synonymous in FITMG94 output but when it -srv turned on, two values become different. Which scenario do you recommend to use -srv for absrel and busted?

  3. And I also find correlation between Full adaptive model (non-synonymous subs/site)/Full adaptive model (synonymous subs/site) and MLE from FITMG94 is 1 but values are different. Meanwhile, in FITMG94, there is also dN and dS, the ratio is also highly correlated with MLE and ratio between Full adaptive model (non-synonymous subs/site)/Full adaptive model (synonymous subs/site). I would like to know how MLE is calculated and why it is different from dN/dS.

Sincerely, Jing Zhang

spond commented 1 year ago

Dear @jzhan6,

I would suggest aBSREL or FitMG94 (with the --local option) but you already found these options.

Your question 1

hyphy /path/to/FitMG94.bf --alignment /path/to/file --type local

and

hyphy absrel  alignment /path/to/file

should give you the same results (within numerical tolerance) for dN/dS (because they fit the same baseline model: MG94xREV with branch-specific dN/dS. They are parameterized a little differently: α and β for FitMG94, so ω := β / α and α and ω for aBSREL. So aBSREL directly estimates the ratio and the branch length, while FitMG94 estimates the numerator and the denominator of the ratio separately.

For example

FitMG94 (Standard MG94)

"branch attributes":{
   "0":{
     "BABOON":{
       "Confidence Intervals":{
         "LB":0,
         "MLE":0,
         "UB":0.685870085661975
        },
       "Nucleotide GTR":0.001678913155839398,
       "Standard MG94":0.001816111834328604,
       "dN":1.267551422387835e-10,
       "dS":0.008604010093736934,
       "nonsynonymous":1e-10,
       "original name":"BABOON",
       "synonymous":0.001816111834328604
      },
     "CAT":{
       "Confidence Intervals":{
         "LB":1.240464554575748,
         "MLE":1.603522853504082,
         "UB":2.038209196049299
        },
       "Nucleotide GTR":0.2660900269656866,
       "Standard MG94":0.276585235737379,
       "dN":0.2880841250666078,
       "dS":0.2336069954178026,
       "nonsynonymous":0.2272760851973249,
       "original name":"CAT",
       "synonymous":0.04930915054005409
      },

aBSREL (Baseline MG94xREV)

 "branch attributes":{
   "0":{
     "BABOON":{
       "Baseline MG94xREV":0.001816744801140633,
       "Baseline MG94xREV omega ratio":0,
       "Corrected P-value":1,
       "Full adaptive model":0.001803952324837679,
       "Full adaptive model (non-synonymous subs/site)":1e-10,
       "Full adaptive model (synonymous subs/site)":0.001803952324837679,
       "LRT":0,
       "Nucleotide GTR":0.001678913155839398,
       "Rate Distributions":        [
[0, 1] 
        ],
       "Rate classes":1,
       "Uncorrected P-value":1,
       "original name":"BABOON"
      },
     "CAT":{
       "Baseline MG94xREV":0.2775404548964485,
       "Baseline MG94xREV omega ratio":1.623404261248429,
       "Corrected P-value":0.005332855196122643,
       "Full adaptive model":0.3581510988637404,
       "Full adaptive model (non-synonymous subs/site)":0.3151425034905455,
       "Full adaptive model (synonymous subs/site)":0.04300859537319501,
       "LRT":13.7288401178057,
       "Nucleotide GTR":0.2660900269656866,
       "Rate Distributions":        [
[0.2090139012270241, 0.5296909802887705],
        [5.432830557916017, 0.4703090197112295] 
        ],
       "Rate classes":2,
       "Uncorrected P-value":0.0003555236797415096,
       "original name":"CAT"
      },

Your question 2

You should generally use --srv Yes (which is the default for BUSTED anyway), because as we showed several times, most recently here, synonymous rate variation is widespread. If it doubt, fit with and without --srv enabled, and compare goodness-of-fit (AIC-c) and also whether or not you get the same inference on selection

Your question 3

For a long-ish discussion of the differences between ω and traditional dN / dS, take a look at section 1.4.3 of http://hyphy.org/resources/hyphybook2007.pdf

dN and dS will be scaled version of synonymous and non-synonymous rates, and this scaling will depend on the codon frequencies in the dataset.

Best, Sergei

jzhan6 commented 1 year ago

Thank you for your answer. You are really helpful!!! I still have several questions:

  1. So I would like to make a final confirmation that MLE is estimate of omega, right?

  2. I also would like to ask is why in busted -srv is turned on by default but in absrel -srv is turned off by default ( in 2.5.42). I assume it is always better to turn on -srv on both methods. (Edit: yes, I find absrel with -srv on is extremely slow :D)

  3. Which method is better to identify genes that is positively selected on a selected branch? For me busted and absrel sound very similar to each other. absrel detect whether there are a proportional of sites undergoing positive selection at selected branches and busted test if any sites in the gene undergo positive selection at selected branches. So it sounds like if busted show significance for a gene, absrel should also show significant and vice versa.

  4. Do you have any suggestions which method is good to identify genes with elevated omega at selected branch compared to other selected branches? We find if running by default, we find lots of immune genes undergo positive selection. But we would like to find genes that are with elevated omega at the selected branch but do not have elevated omega in the child branches and ancestor branches which means we confine the background branches to the ancestral branch and child branches. ( I guess I can use FITMG94 to calculate omega for all branches for each gene and find genes satisfying my requirement but I have no clue about statistical test and get pvalue with it)

spond commented 1 year ago

Dear @jzhan6,

  1. Not sure what you are asking here.
  2. In busted, it's turned on by default because we think it should always be on for these types of analyses. For absrel, I haven't done large-scale systematic explorations of the SRV effect, so we kept it off for now for consistency with published papers, and because of performance hits. The --srv option is sort of "experimental" for absrel.
  3. If you have want to have a branch-level resolution (is branch 'X' subject to selection), aBSREL will give you that. If you want to jointly test for selection on >1 branches (is there selection anywhere on my selected branched), busted can be better because it pools information from multiple branches. When you test one branch only, the methods are very similar, but there are differences in how they model "background" branches.. absrel is better for testing a single branch, because it allows a more flexible background distribution, and will not "overdo" the foreground model (busted by default will use 3 rate classes)
  4. Yes, use BUSTED-PH

Best, Sergei

jzhan6 commented 1 year ago

Dear Dr. Pond,

Not sure what you are asking here.

Sorry for not stating the question clearly. MLE is in .json output from FitMG94 which I think is the maximum likelihood estimate of the omega. I want to make sure this guess is correct.

And I have another question, for tree used in FitMG94 should I use rooted tree or unrooted tree and should I include outgroup or no?

spond commented 1 year ago

Dear @jzhan6,

That is correct, the MLE field is indeed the maximum likelihood estimate of ω. If you specify the --ci Yes flag, you will also get approximate (profile likelihood) confidence intervals, like

"CAT":{
       "Confidence Intervals":{
         "LB":1.240464554575748,
         "MLE":1.603522853504082,
         "UB":2.038209196049299
        }

FitMG94 uses time reversible models, so even if you supply a rooted tree, the analysis will be run on its unrooted version. The choice of outgroup is up to you, but generally not needed, unless you are interested in local estimates for the ancestral branch of a clade (then you need an outgroup for the clade).

Best, Sergei

github-actions[bot] commented 1 year ago

Stale issue message