veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
200 stars 68 forks source link

dN/dS scores from RELAX #1659

Closed liamfriar closed 7 months ago

liamfriar commented 8 months ago

Hi,

I am wondering if there is a way to get the dN/dS values calculated along each branch from the RELAX outputs? Or do I have to run a different program to get those raw numbers?

Thank you!

stevenweaver commented 8 months ago

Dear @liamfriar,

Thank you for your question. While RELAX does calculate ω values specific to each branch, they aren't readily available in standard output files. Only the k parameters are. To obtain these raw ω values would require modifications to the RELAX.bf script.

An alternative approach would be to use aBSREL, which does output inferred rate distributions. However, this might not align with what is inferred in RELAX due to how the branch sets are structured and the additional k parameter in RELAX.

Best, Steven

liamfriar commented 8 months ago

Hi Steven,

Thank you for your response. Do you think aBSREL would be helpful for the following?

I have ~10 ingroup genomes and ~40 outgroup genomes. I expect the ingroup genomes to be experiencing relaxed selection on many genes. To that end, I have run RELAX on about 2000 single copy gene trees across the 50 genomes. That worked great. I am also curious about whether certain metabolic pathways or cellular functions are under more or less relaxed selective pressures. One way to approach this would be to just look at the distribution of "relaxation" vs. "intensification" vs. "not_significant" results from RELAX in the different pathways, but I think that could be confusing given confidence p-values for each individual result. I think it would be interesting to instead get the dN/dS values for each branch from each gene in each pathway and plot these as distributions for each pathway.

So, for instance: if pathway X has geneA, geneB, geneC, then get all the dN/dS values for each of geneA,B,C and plot these all together for ingroup and outgroup and see if there is a significant (probably T-test) difference between the ingroup and outgroup distributions. I believe there will be very few cases where dN/dS > 1 , so if those are rare, I will throw them out to make interpretation of the results more straightforward.

It looks like aBSREL gives a distribution of omega values for each (#1198), so I am not sure how I would parse those results? I could limit the model to 1 omega value per branch?

Any and all thoughts/suggestions would be hugely helpful. hyphy RELAX was great to run and super useful.

liamfriar commented 8 months ago

It looks as though FitMG94 might be the easiest way to get just an omega value for each branch in a gene tree?

liamfriar commented 8 months ago

Alternatively, from the RELAX results, is this the mean omega values for each group of genes?

### Improving branch lengths, nucleotide substitution biases, and global dN/dS ratios under a full codon model
* Log(L) = -8145.95, AIC-c = 16505.96 (106 estimated parameters)
* non-synonymous/synonymous rate ratio for *Reference* =   0.1185
* non-synonymous/synonymous rate ratio for *Test* =   0.1669
* non-synonymous/synonymous rate ratio for *Unclassified* =   0.2065
stevenweaver commented 8 months ago

Dear @liamfriar,

They are omega values for each branch set defined, but they are not the mean, they are computed by maximum likelihood.

For full details of RELAX values in the JSON output, we have a manual here if that helps at all.

Best, Steven

spond commented 8 months ago

Dear @liamfriar,

Base on your description, I would do the following

  1. Run FitMG94 with the local options, like so (updaing the paths as needed)
hyphy FitMG94.bf --alignment CD2.nex --type local

...

### Estimating confidence intervals for dN/dS along each branch

|            Branch            |     Length     |     dN/dS      |Approximate dN/dS CI|
|:----------------------------:|:--------------:|:--------------:|:------------------:|
|             PIG              |     0.192      |     1.345      |   0.966 - 1.809    |
|             COW              |     0.253      |     1.918      |   1.456 - 2.479    |
|            Node5             |     0.103      |     1.482      |   0.836 - 2.304    |
|            HORSE             |     0.209      |     1.245      |   0.926 - 1.635    |
|             CAT              |     0.277      |     1.604      |   1.241 - 2.038    |
|            Node4             |     0.066      |     0.664      |   0.275 - 1.202    |
|           RHMONKEY           |     0.004      |10000000000.0...|0.000 - 10000.000...|
|            BABOON            |     0.002      |     0.000      |   0.000 - 0.686    |
|            Node11            |     0.026      |     0.401      |   0.139 - 0.817    |
|            HUMAN             |     0.000      |     1.000      |0.000 - 10000.000...|
|            CHIMP             |     0.002      |10000000000.0...|0.000 - 10000.000...|
|            Node14            |     0.018      |     0.368      |   0.052 - 0.924    |
|            Node10            |     0.110      |     1.915      |   1.270 - 2.728    |
|            Node3             |     0.290      |     0.432      |   0.317 - 0.573    |
|             RAT              |     0.066      |     1.089      |   0.610 - 1.717    |
|            MOUSE             |     0.122      |     0.525      |   0.343 - 0.755    |
  1. In the resulting JSON file, per branch dN/dS values will be recorded in the Confidence Intervals object, from which you want the MLE. dS and dN are computed as explained in section 1.4.3 of http://hyphy.org/resources/hyphybook2007.pdf, and their ratio will be close to but not identical to the MLE estimate of ω
image

With jq you can extract all the values like so:


jq -r '["Branch", "dN/dS"], (.["branch attributes"]["0"] | to_entries | map ([.key,.value["Confidence Intervals"].MLE])[]) | @csv ' CD2.nex.FITTER.json

...

Branch","dN/dS"
"BABOON",0
"CAT",1.60381523178166
"CHIMP",5910.638940970648
"COW",1.918003612281158
"HORSE",1.24537295438726
"HUMAN",0
"MOUSE",0.5249419169958035
"Node10",1.914890634109143
"Node11",0.4008445971927094
"Node14",0.368278547471397
"Node3",0.4318011219571806
"Node4",0.6637552081371882
"Node5",1.482128238110369
"PIG",1.344721109212511
"RAT",1.08887083731336
"RHMONKEY",10000

Best, Sergei

liamfriar commented 7 months ago

@spond 's last suggestion appears to have worked great. And thank you for the jq one-liner! This issue #898 was also helpful for me to run long sequences of concatenated genes.

spond commented 7 months ago

Dear @liamfriar,

My pleasure. Glad this was helpful.

Best, Sergei