veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
201 stars 68 forks source link

Getting gene wide dN and dS values #1573

Closed ElizabethRobbins closed 1 year ago

ElizabethRobbins commented 1 year ago

Hello,

Can HYPHY calculate gene wide dN and dS values separately?

I have used FUBAR previously, which gives site specific dN and dS values in the output. Would it be appropriate to average all site specific dN values to get a gene wide dN value? And similar for dS?

Thank you in advance, Lizzie

spond commented 1 year ago

Dear @ElizabethRobbins,

This question comes up periodically. A while ago we even wrote a little passage on it in a book chapter (see the box on pg 14 of http://hyphy.org/resources/hyphybook2007.pdf)

HyPhy does not estimate dS and dN separately but rather estimates one of two things

  1. The ω ratio directly
  2. The rates of synonymous and non-synonymous substitutions (α and β)

The two are related to dS and dN in the way described in the linked book chapter. Now, if you want a gene-wide dS and dN estimate under a simple codon model (ω does not vary from site to site or branch-to-branch), then your best bet is FitMG94.bf, which can be found here.

hyphy FitMG94.bf --alignment CD2.nex

....

### **Non-synonymous tree** 
((((PIG:0.1428267548771765,COW:0.1837435716721197)Node5:0.07540645230241928,HORSE:0.1565335253856734,CAT:0.2029527757027807)Node4:0.04766707340051019,((RHMONKEY:0.002756088669461498,BABOON:0.001313839023437547)Node11:0.01921779052523892,(HUMAN:0,CHIMP:0.00135520410116047)Node14:0.01326705738724444)Node10:0.08128279147131144)Node3:0.2104717704029049,RAT:0.0496776467928151,MOUSE:0.08906244413250149)
**Combined tree** 
((((PIG:0.1925706847447909,COW:0.2477380757182482)Node5:0.1016691317151305,HORSE:0.211051271135841,CAT:0.2736374919499469)Node4:0.06426863771014162,((RHMONKEY:0.003715983624720521,BABOON:0.001771424972900691)Node11:0.02591099324429585,(HUMAN:0,CHIMP:0.00182719674583269)Node14:0.01788772928298436)Node10:0.1095920916571899)Node3:0.2837752141101997,RAT:0.06697945680855745,MOUSE:0.1200812541487013)

If you look inside the JSON file that this analysis generates, each branch will have a pair of dS and dN values reported

image

If you wanted to get a dS and dN for the entire gene, you would simply sum up the corresponding attributes over the entire tree. You can do this many ways (e.g. read the json in using Python or R, etc), but here's a simple one liner using the excellent jq utility.

...compute dS....

$jq '.["branch attributes"]["0"] | map(.dS) | add ' CD2.nex.FITTER.json 
2.1082808416221726

...compute dN....

$jq '.["branch attributes"]["0"] | map(.dN) | add ' CD2.nex.FITTER.json 
1.619274321250762

Note that dN/dS will not be the same as the estimated ω ratio. For this specific case, the analysis reports

non-synonymous/synonymous rate ratio =   0.9959

but if you take the dN/dS ratio as computed above, you will get 1.619/2.108 ~ 0.768

HTH, Sergei

ElizabethRobbins commented 1 year ago

Hi @spond,

Thank you for your reply, it was very helpful. I do have a couple of follow up question.

I have run the FitMG94.bf script on an alignment and provided a phylogenetic tree (attached below). Then, as you pointed out, to get the dN and dS of the whole gene I summed the corresponding attributes. However, I noticed that the number of branches it was summing was not what I expected. To my understanding there should be 2n-2 branches in a fully bifurcating tree, where n is the number of leaves. My phylogenetic tree has 773 species and thus there should be 1544 branches. However, only 1258 are being summed. Do you know why this is the case?

Files phylogenetic_tree.txt alignment.txt output.txt

Thank you for your help. All the best, Lizzie

spond commented 1 year ago

Dear @ElizabethRobbins,

Because HyPhy deals (mostly) with time-reversible models, the trees it considers are UNROOTED, which means that there should be 2N-3 branches in such trees.

Further, to increase computational efficiency, HyPhy will first fit a nucleotide model to estimate initial branch lengths. At this stage, the default behavior is to remove all zero-length branches (i.e. create multi-furcations), because this does not change the likelihood or model estimates (there is nothing to be learned from zero-length branches since nothing happens on them).

You can determine whether or not this happened by looking for the following type of output ...

### Obtaining branch lengths and nucleotide substitution biases under the nucleotide GTR model

>kill-zero-lengths –> Yes

### Deleted 276 zero-length internal branches: `Node1001, Node1006, Node1020, Node1023, Node1024, Node1025, Node1026, Node1027, Node1028, Node1029, Node1030, Node1031, Node1033, Node1036, Node1037, Node1038, Node1041, Node1042, Node1044, Node1049, Node1052, Node1053, Node1054, Node1055, Node1058, Node1059, Node1060, Node1063, Node1064, Node1071, Node1072, Node1075, Node1079, Node1080, Node1086, Node1089, Node1090, Node1092, Node1096, Node1101, Node1103, Node1106, Node1111, Node1117, Node1141, Node1144, Node1146, Node1149, Node1152, Node116, Node1169, Node1170, Node1177, Node1178, Node1179, Node1180, Node1181, Node1183, Node1184, Node1185, Node1186, Node1192, Node1197, Node1201, Node1203, Node1209, Node1213, Node1215, Node1222, Node1223, Node1224, Node1225, Node1226, Node1233, Node1236, Node1242, Node1245, Node1247, Node1248, Node1249, Node1257, Node1259, Node1270, Node1281, Node1286, Node1291, Node1297, Node1309, Node1318, Node1327, Node1329, Node1336, Node1346, Node1348, Node1355, Node1357, Node1361, Node1366, Node1369, Node1383, Node1387, Node1388, Node1399, Node140, Node1419, Node1422, Node1433, Node1436, Node1437, Node1438, Node1439, Node1440, Node1441, Node1442, Node1443, Node1448, Node1454, Node1459, Node1460, Node1471, Node1481, Node1489, Node15, Node1511, Node1514, Node152, Node1521, Node1527, Node1529, Node153, Node1531, Node1534, Node1536, Node155, Node158, Node160, Node170, Node171, Node18, Node182, Node2, Node206, Node208, Node212, Node213, Node219, Node235, Node251, Node260, Node280, Node284, Node286, Node299, Node3, Node301, Node31, Node312, Node313, Node320, Node326, Node327, Node335, Node339, Node34, Node345, Node346, Node347, Node348, Node349, Node355, Node36, Node361, Node364, Node365, Node384, Node385, Node387, Node392, Node398, Node4, Node417, Node418, Node419, Node424, Node439, Node443, Node45, Node451, Node454, Node455, Node456, Node457, Node46, Node470, Node471, Node492, Node502, Node503, Node505, Node538, Node550, Node555, Node580, Node595, Node605, Node630, Node635, Node637, Node656, Node657, Node66, Node663, Node665, Node666, Node682, Node683, Node684, Node686, Node687, Node688, Node689, Node695, Node696, Node697, Node701, Node703, Node707, Node710, Node713, Node716, Node722, Node723, Node725, Node726, Node730, Node737, Node738, Node739, Node742, Node743, Node751, Node752, Node757, Node759, Node767, Node768, Node771, Node772, Node786, Node787, Node79, Node8, Node838, Node842, Node843, Node844, Node845, Node846, Node847, Node849, Node861, Node870, Node88, Node89, Node892, Node90, Node922, Node942, Node946, Node948, Node95, Node956, Node975, Node98, Node981, Node983`

Therefore, the total number of branches should be 2*773-3-276 = 1267

I am not sure where 1258 comes from because I 1267 branch records in the output file:

jq '.["branch attributes"]["0"] | length' /Users/sergei/Downloads/alignment.txt.FITTER.json
1267

You can override the default behavior for zero branch lengths using the command line argument

(keep zero-length branches and re-estimate them under the codon model)

--kill-zero-lengths No

(keep zero-length branches and fix them at 0)

--kill-zero-lengths Constrain

When you do this, the analyses will be slower (somewhat) but the result should be the same, and the number of branches -- higher - with the --kill-zero-lengths No setting:

jq '.["branch attributes"]["0"] | length' /Users/sergei/Downloads/alignment.txt.FITTER.json
1543

Best, Sergei

github-actions[bot] commented 1 year ago

Stale issue message