fubar - report neg. sites

hildebra commented 4 years ago

Hey, I'm pretty new to hyphy and thus my questions might sound a bit naiive. I wanted to test a large set of MSA's for positive and negative selected sites, but also get estimates of dn/ds. This is done in automated scripts going over all my alignments. For this I chose fubar as implemented in hyphy. From a few test runs it seems to me that the on screen report (the STDOUT of hyphy) is the only information needed and can be parsed later for the relevant information. However, I was wondering if 1) sites under negative selection are reported somewhere / could maybe reported with a command line flag? 2) if I'm actually doing the right thing, since my MSAs are within the same bacterial species, hence they might represent population variants, not fixed at species level. 3) can I (should I?) combine several MSA's from different genes but the same individuals, if I'm more interested in species wide dnds ratio? 4) is there a way with hyphy (or other command line linux tools) to get a Watterson estimator for my MSA? Any help for my two question would be highly appreciated. All the best, Falk Hildebrand

spond commented 4 years ago

Dear @hildebra,

You could parse the stdout, but it's much easier to read the JSON file that is generated by the analysis (see http://hyphy.org/resources/json-fields.pdf for a format guide). The output includes information about sites under selection. For negative selection look in json['MLE']['content'] for the site by site report. The description of individual columns see json['MLE']['headers']. For example, you could read off sites under selection using high enough values of column 4 (index 3), which is 0:"Posterior probability of negative selection at a site".

FUBAR outputs the estimate of dN/dS distribution, in the grid object of the JSON. Each row reports [dS,dN,posterior gene-wide probability]. You can compute means or other estimates of dN/dS using this distribution

2). You can use the trick of estimating dN/dS (using a different method, like BUSTED) on internal branches only (see issue #418 for example). So long as you are comparing similar estimates (within species for all cases), you should be OK to use these values. You can also take a look at the selection analysis we did in https://www.pnas.org/content/116/50/25057.short (specifically for low divergence gene/species level estimates).

3). You can obtain average estimates in a variety of ways (average individual gene estimates or concat the alignment), although I would suggest not reducing everything to a single value, but rather compare distributions.

Not sure which estimator you are referring to here. Can you provide further details?

Best, Sergei

hildebra commented 4 years ago

Hey Sergei, thanks a lot for your fast answer, yes I was looking for a description on how to best get the information I need, thanks a lot for the link to the pdf, I will use this! I will also take a look at your PNAS paper, looks like a really good starting point for me. About the Watterson estimator, I'm sure you know it under a different name (Watterson theta?): https://en.wikipedia.org/wiki/Watterson_estimator It's quite simple to calculate in the end, thought hyphy might also give this out.

thanks a lot, Falk

spond commented 4 years ago

Dear @hildebra,

Ah, right. You can definitely compute such statistics with HyPhy, since all you'd need is to compute the number of segregating sites, which in most cases is simply the number of variable sites, but there's no standard tool for computing these estimates. One somewhat non-trivial issue is what to do about ambiguous bases, if your alignment has them.

The attached file will compute the estimator for you. Download it and run it as

hyphy /path/to/WattetrsonTheta.txt --alignment /path/to/alignment

Best, Sergei

WattetrsonTheta.txt

hildebra commented 4 years ago

Dear Sergei, many thanks, I have tested the script now and it works good for my data! All the best, Falk

hildebra commented 4 years ago

Dear Sergei, I noticed that the mean dn/ds that is reported in the normal output of fubar is often for ds == 2.815, while dn is variable. In fact this occurs in virtually every case with very few exceptions. Based on this I was going back to your recommendation. I looked into your recommendation to use the grid object, but the dn/ds values in this seem (towards the beginning / end) often very unrealistic. I noticed that in the json['MLE']['content'] something similar might be reported: ["alpha", "Mean posterior synonymous substitution rate at a site"], ["beta", "Mean posterior non-synonymous substitution rate at a site"],

I realize that this is per site, but would the average of alpha/beta in your opinion also serve as approximation to dn/ds? Thanks, Falk

spond commented 4 years ago

Dear @hildebra,

I am not sure I understand what you mean by ds == 2.815. Grid values, both for α and β are variable, and have variable weights. Because the grid typically includes values where α = 0, computing dN/dS (β/α) may return infinite values.

Can you give me some examples where unrealistic values are returned (preferably with the entire .json file)?

Best, Sergei

hildebra commented 4 years ago

Dear Sergei,

I attach the log (displayed on stdout) and the json. in the log, mean ds is given as 2.815, and this is a very typical situations (I have been running hyphy fubar over > 100,000 MSA in the meantime). However, not always, but usually 2.815. Therefore I'm now looking in other ways to get ds for my nt MSA. I was thinking to either use the mean value obtained from grid (as you explained initially, see top), or to use the mean value over all sites, alpha & beta. alpha/beta seems intuitively like the better choice to me, but I wanted to double check this with you first. many thanks, Falk

COG0256.hyphy.fubar.unID.log.json.gz COG0256.hyphy.fubar.unID.log

spond commented 4 years ago

Dear @hildebra,

The dS reported to the console is simply a value used for calibration, i.e. converting nucleotide branch lengths to codon branch lengths. It is not something you should be using for estimation.

Here's a simple Python script which consumes a JSON file from FUBAR and outputs (posterior) mean dS and dN values

import sys, json

with open(sys.argv[1],"r") as fh:
    fubar = json.load (fh)["grid"]

ds = sum (r[0] * r[2] for r in fubar)
dn = sum (r[1] * r[2] for r in fubar)
print ("ds = %g\ndn = %g" % (ds, dn))
print ("Weights")
print ("\tdS > dN : %g" % sum (r[2] for r in fubar if r[0] > r[1]))
print ("\tdS = dN : %g" % sum (r[2] for r in fubar if r[0] == r[1]))
print ("\tdS < dN : %g" % sum (r[2] for r in fubar if r[0] < r[1]))

Because the FUBAR grid contains non-zero weights placed on dS = 0 points, you can't directly estimate the dN/dS ratio from the grid (it will be infinite), but you can either use dN - dS, or (although it's not a great way to go), E[dN]/E[dS].

If the primary goal of your analysis is to estimate gene-wide dN/dS, then the standard FUBAR is probably not a great way to go because of the way the grid is defined (to look for positive/negative selection).

Best, Sergei

PS On your example JSON, the output for the script is

ds = 4.29592
dn = 3.74012
Weights
    dS > dN : 0.501926
    dS = dN : 0.0536029
    dS < dN : 0.444472

hildebra commented 4 years ago

Dear Sergei, thanks a lot for the script, I have been using this now to calculate dnds on my dataset. Since you mentioned that this might be suboptimal, do you suggest another method within hyphy (or elsewhere), if my primary interest would be gene-wide dnds? The site wise average ds and dn value from the json ["MLE"]["content"]["0"] table (col1,col2), would this be completely unuseable as estimates? thanks & sorry for the long follow up, Falk

spond commented 4 years ago

Dear Falk,

You can compete site-wise average, but some of those can also be infinite. Considering that you already have FUBAR estimates, I would suggest using scaled dN-dS and comparing it to 0.

import sys, json

with open(sys.argv[1],"r") as fh:
    fubar = json.load (fh)
    fubar_grid = fubar["grid"]
    fubar_post = fubar["MLE"]["content"]["0"]
    fubar_lengths = sum (k["Nucleotide GTR"] for (i,k) in fubar["branch attributes"]["0"].items())

global_diff = sum ((r[1]-r[0]) * r[2] for r in fubar_grid)

site_diff = sum (r[2] for r in fubar_post) / len (fubar_post)

print ("Global scaled dN-dS = %g" % (global_diff/fubar_lengths))
print ("Site-averaged scaled dN-dS = %g" % (site_diff/fubar_lengths))

Best, Sergei

hildebra commented 4 years ago

Thanks a lot Serei, I will try to run this on my data now! best, Falk

veg / hyphy

fubar - report neg. sites #1063