veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
205 stars 69 forks source link

Interpreting the results from dNdSRateAnalysis.bf #288

Closed PakornAiewsakun closed 9 years ago

PakornAiewsakun commented 9 years ago

Hi there,

I am currently having a problem with interpreting the results from dNdSRateAnalysis.bf. When I tried to export the results (Analyses -> Results -> Save results), I faced with 2 likelihood-function options: (I) 'nuc_lf' and (ii) 'lf'. The first option (nuc_lf) gave me a 'nucTree', and the second option (lf) gave me a 'givenTree', and the two are substantially different.

Could anyone tell me how the two options and the two trees differ from one another, as well as what the units of the branch lengths are for these two trees (i.e. nucleotide substitutions per site or codon substitutions per codon) ?

I was also trying to estimate the confidence intervals for the parameter estimated values, using the Latin Hypercub +SIR method. In addition to the problem of the likelihood-function options, I found that the MLEs, and the values shown in the trees are not the same. Could any one clarify this to me as well?

Any helps will be very much appreciated.

Best regards, P. Aiewsakun, PhD student, University of Oxford

spond commented 9 years ago

Dear P. Aiewsakun,

Please note that these analyses (from our 2005) have been superseded, for most purposes, with newer tools, such as FUBAR (for site-level detection).

Regarding your specific questions.

  1. You probably want to save lf, because it is the result of fitting a codon-substitution model to the data. 2. All branch lengths in HyPhy are converted to expected substitutions per unit time per nucleotide site. 3. I am not sure what you mean by the "values shown in the tree" comment in your question about the Latin Hypercube + SIR resampler. The LHC sampler returns an approximate 'posterior' sample for each model parameter. You can condense those into a single values, by taking means, medians, etc.

Sergei

PakornAiewsakun commented 9 years ago

Dear Sergei,

Thank you so much for the answers. They are very helpful.

Briefly, what I want to do is to reconstruct a phylogeny that accounts for natural selection pressure, So, FUBAR seems like an inappropriate option to me.

I came across these two papers which seem to do what I have been looking for: (i) Wertheim & Kosakovsky Pond, MBE, 2011 and (II) Wertheim et al., JVI, 2013. So I tried to follow them. (I just realised that you are actually the author of these papers. If you don't mind, I would like to ask you some questions regarding these papers.)

In Wertheim et al., JVI, 2013, it was clear that BS-REL was used, so I could follow the paper quite easily. However, in Wertheim & Kosakovsky Pond, MBE, 2011, it was unclear to me how the branch lengths were re-estimated. Given the terminology used, I guessed dNdSRateAnalysis.bf was employed. However, since you mentioned that dNdSRateAnalysis.bf has already been superseded by other methods, I guess I was wrong. Could you please tell me which method you used to re-estimate the branch length in Wertheim & Kosakovsky Pond, MBE, 2011?

Regarding the likelihood function options, now I know that I should use the 'lf' option; however could you please tell me the differences between 'nuc_lf' and 'lf', and how the two trees return from these two differ from one another? Alternatively, could you please tell me where to look for the answers of these questions? I have been searching, but couldn't find any information about them.

Regarding the Latin Hypercube + SIR resampler, here is a snap shot of the tree returned from the dNdSRateAnalysis.bf analyses.

Tree nucTree=(CEAVCG:0.213...

However, the LHS resampler showed that the lower bound, the MLE, and the upper bound for nucTree.CEAVCG.t is 0.319, 0.447, and 0.648, for example. Clearly, these are very different from the estimated CEAVCG branch length, which is 0.213. Does this mean that the paramter 'nucTree.CEAVCG.t' is actually not the branch length but something else? If this is the case, could you please tell me what module I can use to the estimate the confidence intervals of the branch lengths?

Best regards, P. Aiewsakun

spond commented 9 years ago

Dear P. Aiewsakun,

  1. Please use aBSREL for all analyses of this sort. This is our best current method implementation, described in http://mbe.oxfordjournals.org/content/32/5/1342.abstract, and in HyPhy and on test.datamonkey.org. Note that aBSREL is not meant for topology inference, but rather for fitting a complex evolutionary model to a predefined topology, e.g. to estimate branch lengths.
  2. _nuclf is an auxiliary function used by the analysis to estimate initial branch length values. It is not meant to be used for anything else.
  3. nucTree is the tree with nucleotide-based branch lengths. You probably want to look at givenTree which is the one with codon-based branch lengths. Also nucTree.CEAVCG.t is not the branch length, but rather a time parameter. The branch length is a function of the time parameter, nucleotide substitution biases, and equilibrium frequencies.

Sergei

PakornAiewsakun commented 9 years ago

Dear Sergei,

Thank you very much for your quick replies. They really help to clarify many of my confusions.

Just to be sure that I have understood everything correctly. So, is aBSREL the adaptive version of BS-REL, which can be found in Hyphy 2.2.4 by selecting Positive Selection option -> BranchSiteREL.bf -> Universal genetic code -> Run the adaptive version of BS-REL -> Both alpha and beta vary along branch-site combinations (I guess this means allowing both synonymous and nonsynonymous substitution rates to vary both among branches and sites; is this correct)?

Also, could you please tell me how to sample posterior branch lengths from aBSREL outputs? The saved results from the LHS sampling seem to contain only 'baselineTree.[node-name].nonsyn', 'baselineTree.[node-name].syn', and 'substitution rate' parameters, but not branch lengths.

Best regards, P. Aiewsakun

spond commented 9 years ago

Hi P. Aiewsakun,

Your settings are correct except you need to choose [No] Alpha varies from branch to branch, while omega varies among branch-site combinations for the Allow branch-site variation in synonymous rates? dialog.

You process the output of aBSREL by

  1. Executing the .fit file in HyPhy (it is simply a batch file)
  2. Running LHC on the file
  3. Running a post-processor script on the output samples; it is not a part of the standard HyPhy distribution. I will need to locate it (I am out of the country at the moment) and post it. Stay tuned. You can do steps 1 and 2 (they are time consuming) in the meantime.

Sergei

PakornAiewsakun commented 9 years ago

Thank you very much.

P. Aiewsakun