Closed hildebra closed 4 years ago
Dear @hildebra,
You could parse the stdout
, but it's much easier to read the JSON file that is generated by the analysis (see http://hyphy.org/resources/json-fields.pdf for a format guide). The output includes information about sites under selection. For negative selection look in json['MLE']['content']
for the site by site report. The description of individual columns see json['MLE']['headers']
. For example, you could read off sites under selection using high enough values of column 4 (index 3), which is 0:"Posterior probability of negative selection at a site"
.
FUBAR outputs the estimate of dN/dS distribution, in the grid object of the JSON. Each row reports [dS,dN,posterior gene-wide probability]
. You can compute means or other estimates of dN/dS using this distribution
2). You can use the trick of estimating dN/dS (using a different method, like BUSTED) on internal branches only (see issue #418 for example). So long as you are comparing similar estimates (within species for all cases), you should be OK to use these values. You can also take a look at the selection analysis we did in https://www.pnas.org/content/116/50/25057.short (specifically for low divergence gene/species level estimates).
3). You can obtain average estimates in a variety of ways (average individual gene estimates or concat the alignment), although I would suggest not reducing everything to a single value, but rather compare distributions.
Best, Sergei
Hey Sergei, thanks a lot for your fast answer, yes I was looking for a description on how to best get the information I need, thanks a lot for the link to the pdf, I will use this! I will also take a look at your PNAS paper, looks like a really good starting point for me. About the Watterson estimator, I'm sure you know it under a different name (Watterson theta?): https://en.wikipedia.org/wiki/Watterson_estimator It's quite simple to calculate in the end, thought hyphy might also give this out.
thanks a lot, Falk
Dear @hildebra,
Ah, right. You can definitely compute such statistics with HyPhy, since all you'd need is to compute the number of segregating sites, which in most cases is simply the number of variable sites, but there's no standard tool for computing these estimates. One somewhat non-trivial issue is what to do about ambiguous bases, if your alignment has them.
The attached file will compute the estimator for you. Download it and run it as
hyphy /path/to/WattetrsonTheta.txt --alignment /path/to/alignment
Best, Sergei
Dear Sergei, many thanks, I have tested the script now and it works good for my data! All the best, Falk
Dear Sergei, I noticed that the mean dn/ds that is reported in the normal output of fubar is often for ds == 2.815, while dn is variable. In fact this occurs in virtually every case with very few exceptions. Based on this I was going back to your recommendation. I looked into your recommendation to use the grid object, but the dn/ds values in this seem (towards the beginning / end) often very unrealistic. I noticed that in the json['MLE']['content'] something similar might be reported: ["alpha", "Mean posterior synonymous substitution rate at a site"], ["beta", "Mean posterior non-synonymous substitution rate at a site"],
I realize that this is per site, but would the average of alpha/beta in your opinion also serve as approximation to dn/ds? Thanks, Falk
Dear @hildebra,
I am not sure I understand what you mean by ds == 2.815
. Grid values, both for α and β are variable, and have variable weights. Because the grid typically includes values where α = 0, computing dN/dS (β/α) may return infinite values.
Can you give me some examples where unrealistic values are returned (preferably with the entire .json
file)?
Best, Sergei
Dear Sergei,
I attach the log (displayed on stdout) and the json. in the log, mean ds is given as 2.815, and this is a very typical situations (I have been running hyphy fubar over > 100,000 MSA in the meantime). However, not always, but usually 2.815. Therefore I'm now looking in other ways to get ds for my nt MSA. I was thinking to either use the mean value obtained from grid (as you explained initially, see top), or to use the mean value over all sites, alpha & beta. alpha/beta seems intuitively like the better choice to me, but I wanted to double check this with you first. many thanks, Falk
COG0256.hyphy.fubar.unID.log.json.gz COG0256.hyphy.fubar.unID.log
Dear @hildebra,
The dS
reported to the console is simply a value used for calibration, i.e. converting nucleotide branch lengths to codon branch lengths. It is not something you should be using for estimation.
Here's a simple Python script which consumes a JSON file from FUBAR and outputs (posterior) mean dS and dN values
import sys, json
with open(sys.argv[1],"r") as fh:
fubar = json.load (fh)["grid"]
ds = sum (r[0] * r[2] for r in fubar)
dn = sum (r[1] * r[2] for r in fubar)
print ("ds = %g\ndn = %g" % (ds, dn))
print ("Weights")
print ("\tdS > dN : %g" % sum (r[2] for r in fubar if r[0] > r[1]))
print ("\tdS = dN : %g" % sum (r[2] for r in fubar if r[0] == r[1]))
print ("\tdS < dN : %g" % sum (r[2] for r in fubar if r[0] < r[1]))
Because the FUBAR grid contains non-zero weights placed on dS = 0
points, you can't directly estimate the dN/dS
ratio from the grid (it will be infinite), but you can either use dN - dS
, or (although it's not a great way to go), E[dN]/E[dS]
.
If the primary goal of your analysis is to estimate gene-wide dN/dS, then the standard FUBAR is probably not a great way to go because of the way the grid is defined (to look for positive/negative selection).
Best, Sergei
PS On your example JSON, the output for the script is
ds = 4.29592
dn = 3.74012
Weights
dS > dN : 0.501926
dS = dN : 0.0536029
dS < dN : 0.444472
Dear Sergei, thanks a lot for the script, I have been using this now to calculate dnds on my dataset. Since you mentioned that this might be suboptimal, do you suggest another method within hyphy (or elsewhere), if my primary interest would be gene-wide dnds? The site wise average ds and dn value from the json ["MLE"]["content"]["0"] table (col1,col2), would this be completely unuseable as estimates? thanks & sorry for the long follow up, Falk
Dear Falk,
You can compete site-wise average, but some of those can also be infinite. Considering that you already have FUBAR estimates, I would suggest using scaled dN-dS
and comparing it to 0.
import sys, json
with open(sys.argv[1],"r") as fh:
fubar = json.load (fh)
fubar_grid = fubar["grid"]
fubar_post = fubar["MLE"]["content"]["0"]
fubar_lengths = sum (k["Nucleotide GTR"] for (i,k) in fubar["branch attributes"]["0"].items())
global_diff = sum ((r[1]-r[0]) * r[2] for r in fubar_grid)
site_diff = sum (r[2] for r in fubar_post) / len (fubar_post)
print ("Global scaled dN-dS = %g" % (global_diff/fubar_lengths))
print ("Site-averaged scaled dN-dS = %g" % (site_diff/fubar_lengths))
Best, Sergei
Thanks a lot Serei, I will try to run this on my data now! best, Falk
Hey, I'm pretty new to hyphy and thus my questions might sound a bit naiive. I wanted to test a large set of MSA's for positive and negative selected sites, but also get estimates of dn/ds. This is done in automated scripts going over all my alignments. For this I chose fubar as implemented in hyphy. From a few test runs it seems to me that the on screen report (the STDOUT of hyphy) is the only information needed and can be parsed later for the relevant information. However, I was wondering if 1) sites under negative selection are reported somewhere / could maybe reported with a command line flag? 2) if I'm actually doing the right thing, since my MSAs are within the same bacterial species, hence they might represent population variants, not fixed at species level. 3) can I (should I?) combine several MSA's from different genes but the same individuals, if I'm more interested in species wide dnds ratio? 4) is there a way with hyphy (or other command line linux tools) to get a Watterson estimator for my MSA? Any help for my two question would be highly appreciated. All the best, Falk Hildebrand