veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
218 stars 69 forks source link

Using HBL to report dN and dS values in batch absREL runs #509

Closed schraderL closed 7 years ago

schraderL commented 7 years ago

Hi and thanks for the great support! I am using Hyphy to test for signatures of positive selection across ~10 eukaryots in several thousand protein-coding genes. My pipeline is pretty straight-forward. I infer orthologs by a RBH-approach, align the sequences, clean the alignments and run absREL using the species tree. In a second run, I would like to run BUSTED and select a single branch as foreground. I now have 2 questions, one specific and one general:

1) I would like to look at dN and dS rates independently, but absREL only reports dN/dS . I am sure reporting dN and dS values is possible by editing the BranchSiteREL.bf using HBL. Unfortunately, I don't know how to do this and am hoping for some advice. I would eventually want to do the same in BUSTED. 2) Should I still run a nucleotide model selection step for each gene and test for recombination before running absREL? In particular the recombination is giving me some headache, because I am using the species tree and not the gene tree for the absREL tests.

spond commented 7 years ago

Dear @schraderL,

  1. Yes, it is possible to output dN or dS: do you have a specific definition you had in mind. The one I strongly recommend is on page 14 of http://www.hyphy.org/pubs/hyphybook2007.pdf

The easiest way to obtain these values is by running a post-processing script on .fit files output by aBSREL (or BUSTED); I can supply those once you give me the definition of what dS and dN mean:)

  1. Nucleotide model: no (just use GTR). Personally, I think that nucleotide model selection is a relic of the past -- the main concern is not to use a model that is too simple, and there is almost no computational benefit to fitting models simpler than GTR. Recombination could be a problem, yes. You can always run an SH test (ML tree vs a fixed topology -- the species tree) to look for strong outliers.

Best, Sergei

schraderL commented 7 years ago

Hi Sergei,

thanks for the reply! The definition from the handbook sounds good, though I wasn't really aware that there are different definitions for calculating dN and dS. I have written a post-processing script to filter the values from the .mglocal.fit files, but if I should rather use the .fit file, I'd be happy to know. In general, I would like to use the dN and dS values which are used to calculate the mean dN/dS for each branch in absREL.

Re: question 2: Thanks for that as well! It's good to know that I don't have to run model selections.

Best, Lukas

spond commented 7 years ago

Dear @schraderL,

The mean dN/dS can be estimated in two different ways

  1. Directly as the ratio by maximum likelihood under the local MGxREV model
  2. As the mean of the (optimized) hyper parameter distribution.

In neither case does aBSREL actually estimate dS and dN separately. Those may be derived from the MLE values of corresponding rate matrices. If you indicate which values you are interested in, I'll post a simple HBL script to post-process the appropriate file. Also: do you want the output as Newick with embedded values (one for dS and one for dN), or as a CSV file with three columns: node, dS, dN.

Best, Sergei

schraderL commented 7 years ago

I would be curious to try both ways of estimating dN/dS, just so I can try to understand the differences. Apart from that, a CSV with node, dS and dN would be best. Could you try to explain, why using the dN and dS values from the .mcglocal.fit files isn't appropriate? After all, these values appear to yield the same dN/dS. For example, in a run of absREL, the dN and dS estimates in .mcglocal.fit are dS=0.09418547299874075 dN=0.02019853991223301 based on that, the dN/dS for that branch is dN/dSa=0.2144549 and the mean dN/dS reported in the final output file for the same branch is dN/dS=0.2144549394841714

So this looks pretty good. I haven't checked all the branches for all the genes, so I don't know if there are instances where the results aren't consitent.

Also, some few branches in some genes show dS values of up to 9999.999 (same for some dN rates). My guess is these values are a product of and necessity for the modelling, because based on my understanding of dS it doesn't make sense to have values larger than 1.

Thanks, Lukas

spond commented 7 years ago

Dear @schraderL,

Sorry, I was incorrect in my previous post. The MG local model does estimate the synonymous rate, α, and the non-synonymous rate, β separately for each branch, so their ratio will indeed give you dN/dS.

The HBL script for post-processing aBSREL.fit files is attached.

aBSREL-extract-dS-dN.txt

Best, Sergei

schraderL commented 7 years ago

Hi Sergei,

thanks for the script!

best, lukas

sushidaren commented 1 year ago

Dear @schraderL,

Sorry, I was incorrect in my previous post. The MG local model does estimate the synonymous rate, α, and the non-synonymous rate, β separately for each branch, so their ratio will indeed give you dN/dS.

The HBL script for post-processing aBSREL.fit files is attached.

aBSREL-extract-dS-dN.txt

Best, Sergei

Dear Dr. Pond, This is Xiaojuan Li,I have calculate dn and ds using FitMG94, which reported trees scaled on expected number of synonymous and non synonymous substitutions per codon. But I also need trees scaled on ds and dn, and dn/ds,thus could you please help me how to extract dS, dN and dN/dS. Thank you for your consideration. Sincerely Xiaojuan Li