veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
210 stars 69 forks source link

Allowing for site-to-site synonymous substitution rates variations #392

Closed NY3 closed 8 years ago

NY3 commented 8 years ago

Hi there, I was hoping to run the MG94 model and/or M3 (in PAML) but to account for variations in synonymous substitution rates and was wondering how I would do that in Hyphy. I read in Pond and Muse (2005) that this was possible however I am having trouble finding documentation on how to carry out this analysis in Hyphy. I have tried running the following series of options from the Hyphy command line:

and these were the results I got: Log Likelihood = -10552.8023650997; Shared Parameters: RS_3=56.14384324424616 RS_1=0.2175699701179334 PS_2=0.9134942727111495 R=0.0003527551234002609 PS_1=0.856129382947568 c_scale=RS_1_PS1+1(1-PS_1)_PS_2+1_RS3(1-PS1)(1-PS_2)=1.016438661452195

I am having trouble understanding these results and thats why I am starting to think that maybe I ran a completely different analysis than what I thought.

Thank you for your help, NY

spond commented 8 years ago

Dear @NY3,

If you want to find individual sites subject to selection when there is synonymous rate variation, please run FUBAR or MEME (see this (tutorial)[https://github.com/veg/hyphy-tutorials/tree/master/selection]).

The models from Pond & Muse (2005) can be accessed through the following list of options


       /HYPHY 2.220150819beta(MP) for Darwin on x86_64\       
***************** TYPES OF STANDARD ANALYSES *****************

    (1) Basic Analyses
    (2) Codon Selection Analyses
    (3) Compartmentalization
    (4) Data File Tools
    (5) Miscellaneous
    (6) Model Comparison
    (7) Kernel Analysis Tools
    (8) Molecular Clock
    (9) Phylogeny Reconstruction
    (10) Positive Selection
    (11) Recombination
    (12) Selection/Recombination
    (13) Relative Rate
    (14) Relative Ratio
    (15) Substitution Rates

 Please select type of analyses you want to list (or press ENTER to process custom batch file):2

 ***************** FILES IN 'Codon Selection Analyses' ***************** 

    (1) Run a selection analysis.
    (2) Run a selection analysis using a general discrete bivariate (dN AND dS) distribution; the appropriate number of rate classes is determined automatically.
    (3) Use a series of LR tests to decide if dN and dS rate distributions are the same or different between two codon alignments.
    (4) Interpret bivariate codon rate analysis results.
    (5) Interpret analysis results.

 Please select the file you want to use (or press ENTER to return to the list of analysis types):1

/usr/local/lib/hyphy/TemplateBatchFiles/Please specify a codon data file::/Volumes/sergei-raid/Dropbox/Work/UserData/absrel/issue391.nex 

______________READ THE FOLLOWING DATA______________
5 species:{ARABIANCAMEL,WHITESOUTHERNRHINO,DONKEY_TRIM25_1,HORSE1STTRIM25,HORSETRIM25_2};
Total Sites:2097;
Distinct Sites:203

            +-------------------+
            |Choose Genetic Code|
            +-------------------+

    (1):[Universal] Universal code. (Genebank transl_table=1).
    (2):[Vertebrate mtDNA] Vertebrate mitochondrial DNA code. (Genebank transl_table=2).
    (3):[Yeast mtDNA] Yeast mitochondrial DNA code. (Genebank transl_table=3).
    (4):[Mold/Protozoan mtDNA] Mold, Protozoan and Coelenterate mitochondrial DNA and the Mycloplasma/Spiroplasma code. (Genebank transl_table=4).
    (5):[Invertebrate mtDNA] Invertebrate mitochondrial DNA code. (Genebank transl_table=5).
    (6):[Ciliate Nuclear] Ciliate, Dasycladacean and Hexamita Nuclear code. (Genebank transl_table=6).
    (7):[Echinoderm mtDNA] Echinoderm mitochondrial DNA code. (Genebank transl_table=9).
    (8):[Euplotid Nuclear] Euplotid Nuclear code. (Genebank transl_table=10).
    (9):[Alt. Yeast Nuclear] Alternative Yeast Nuclear code. (Genebank transl_table=12).
    (10):[Ascidian mtDNA] Ascidian mitochondrial DNA code. (Genebank transl_table=13).
    (11):[Flatworm mtDNA] Flatworm mitochondrial DNA code. (Genebank transl_table=14).
    (12):[Blepharisma Nuclear] Blepharisma Nuclear code. (Genebank transl_table=15).
    (13):[Chlorophycean mtDNA] Chlorophycean Mitochondrial Code (transl_table=16).
    (14):[Trematode mtDNA] Trematode Mitochondrial Code (transl_table=21).
    (15):[Scenedesmus obliquus mtDNA] Scenedesmus obliquus mitochondrial Code (transl_table=22).
    (16):[Thraustochytrium mtDNA] Thraustochytrium Mitochondrial Code (transl_table=23).

 Please choose an option (or press q to cancel selection):1

A tree was found in the data file:
((ARABIANCAMEL:0.1041474616932968,WHITESOUTHERNRHINO:0.03999117955575802):0.02324689613601063,(DONKEY_TRIM25_1:0.03274904943794905,HORSE1STTRIM25:0.01105380397595058):0.02784823679205511,HORSETRIM25_2:0.02168842234915624)

Would you like to use it:(Y/N)?y

            +--------------+
            |Branch Lengths|
            +--------------+

    (1):[Codon Model] Jointly optimize rate parameters and branch lengths (slow and thorough)
    (2):[Nucleotide Model] Estimate branch lengths once, using an appropriate nucleotide model with a global tree scaling factor (quick and dirty).
    (3):[Fixed (nucleotide)] Branch lengths (measured in expected substitutions per NUCLEOTIDE) are fixed at the values read from the tree string (quicker and dirtier).
    (4):[Fixed (codon)] Branch lengths (measured in expected substitutions per CODON) are fixed at the values read from the tree string (quicker and dirtier).

 Please choose an option (or press q to cancel selection):1

            +-------------------+
            |Rate Matrix Options|
            +-------------------+

    (1):[MG94] Standard Muse-Gaut 94 model.
    (2):[MG94xHKY85] MG94 with the transition/transverion ratio parameter kappa.
    (3):[MG94xREV] MG94 with 5 additional parameters for each type of nucleotide substitution ratio. (Recommended Rate Matrix)
    (4):[MG94xCustom] MG94 crossed with an arbitrary nucelotide reversible model, except F81.
    (5):[GY94] Goldman-Yang 94 model (similar to MG94xHKY85)
    (6):[MG94Multi] MG94 with multiple classes of non-synonymous substitutions in addition to being crossed with an arbitrary nucelotide reversible model, except F81.
    (7):[MG94NMulti] MG94 with numerical bias corrections for various amino-acid substitution in addition to being crossed with an arbitrary nucelotide reversible model, except F81.

 Please choose an option (or press q to cancel selection):3

            +----------------------+
            |Rate Variation Options|
            +----------------------+

    (1):[Run All] Run all available models.
    (2):[Run Custom] Choose some of the available models.

 Please choose an option (or press q to cancel selection):2

            +---------------------+
            |Rate Variation Models|
            +---------------------+

    (1):[Constant] Constant Rate Model: no rate variation across sites.
    (2):[Proportional] Proportional Variable Rates Model: dS and dN vary along the sequence, but dN = R*dS for every site
    (3):[Nonsynonymous] Non-synonymous Variable Rates Model: dS = 1 for every site, while dN is drawn from a given distribution.
    (4):[Dual] Dual Variable Rates Model: dS and dN are drawn from a bivariate distribution (independent or correlated components). Recommened model.
    (5):[Lineage Dual] Lineage Dual Variable Rates Model:  dS and dN are drawn from a bivariate distribution (independent or correlated components), plus each lineage has an adjustment factor for the E[dN]/E[dS].

 Please choose option 1, enter d to complete selection, enter q to cancel:1

    (2):[Proportional] Proportional Variable Rates Model: dS and dN vary along the sequence, but dN = R*dS for every site
    (3):[Nonsynonymous] Non-synonymous Variable Rates Model: dS = 1 for every site, while dN is drawn from a given distribution.
    (4):[Dual] Dual Variable Rates Model: dS and dN are drawn from a bivariate distribution (independent or correlated components). Recommened model.
    (5):[Lineage Dual] Lineage Dual Variable Rates Model:  dS and dN are drawn from a bivariate distribution (independent or correlated components), plus each lineage has an adjustment factor for the E[dN]/E[dS].

 Please choose option 2, enter d to complete selection, enter q to cancel:3

    (2):[Proportional] Proportional Variable Rates Model: dS and dN vary along the sequence, but dN = R*dS for every site
    (4):[Dual] Dual Variable Rates Model: dS and dN are drawn from a bivariate distribution (independent or correlated components). Recommened model.
    (5):[Lineage Dual] Lineage Dual Variable Rates Model:  dS and dN are drawn from a bivariate distribution (independent or correlated components), plus each lineage has an adjustment factor for the E[dN]/E[dS].

 Please choose option 3, enter d to complete selection, enter q to cancel:4

    (2):[Proportional] Proportional Variable Rates Model: dS and dN vary along the sequence, but dN = R*dS for every site
    (5):[Lineage Dual] Lineage Dual Variable Rates Model:  dS and dN are drawn from a bivariate distribution (independent or correlated components), plus each lineage has an adjustment factor for the E[dN]/E[dS].

 Please choose option 4, enter d to complete selection, enter q to cancel:d 

            +-------------------+
            |Distribution Option|
            +-------------------+

    (1):[Syn:Gamma, Non-syn:Gamma] Both syn and non-syn rates are drawn from the gamma distributions for all models.
    (2):[Syn:Gamma, Non-syn:Inv+Gamma] Syn and non-syn rates are drawn from the gamma distributions for Proportional and Nonsynonymous. For Dual and Local Dual, syn rates are drawn from the gamma distribution, and non-syn rates - from Inv+Gamma.
    (3):[Independent Discrete] Independent General Discrete Distributions (Recommended setting)
    (4):[Correlated Discrete] Correlated General Discrete Distributions
    (5):[Non-positive Discrete] General Discrete Distribution for dS, and dN, but constrained so that dN<=dS. Useful to perform a LRT for presence of selection in an alignment

 Please choose an option (or press q to cancel selection):3

            +---------------------+
            |Initial Value Options|
            +---------------------+

    (1):[Default] Use default inital values for rate distribution parameters.
    (2):[Randomized] Select initial values for rate distribution parameters at random.

 Please choose an option (or press q to cancel selection):1
<<Number of synonymous (and single variable rate modles) rate classes (permissible range = [2,32], default value = 3, integer)>>3
<<Number of non-synonymous rate classes (permissible range = [1,32], default value = 3, integer)>>3

/usr/local/lib/hyphy/TemplateBatchFiles/Save summary result file to::/Volumes/sergei-raid/Dropbox/Work/UserData/absrel/issue391.out 

RUNNING MG94xREV MODEL COMPARISONS on /Volumes/sergei-raid/Dropbox/Work/UserData/absrel/issue391.nex

########### 3x3 CLASSES ###########

+---------------------+---------------------+---------------+-------------------+-------------------+---------------+-----+-----------+
|       Model         |   Log Likelihood    | Synonymous CV |  NS Exp and CV    |  N/S Exp and CV   |    P-Value    | Prm |    AIC    |
+---------------------+---------------------+---------------+-------------------+-------------------+---------------+-----+-----------+
| Constant Rates      |         -4937.01838 |      N/A      |  0.45892, 0.00000 |  0.45892, 0.00000 |      N/A      |   13|    9900.04|
+---------------------+---------------------+---------------+-------------------+-------------------+---------------+-----+-----------+
| Var. N.Syn. Rates   |         -4904.41367 |      N/A      |  0.53968, 1.26192 |  0.53968, 1.26192 |      N/A      |   17|    9842.83|
+---------------------+---------------------+---------------+-------------------+-------------------+---------------+-----+-----------+
| Dual Variable Rates |         -4902.10446 |    0.73986317 |  0.49464, 1.25728 |  0.91240, 1.64439 |    0.32873528 |   21|    9846.21|
+---------------------+---------------------+---------------+-------------------+-------------------+---------------+-----+-----------+

Sergei

NY3 commented 8 years ago

Hi Sergei, thanks for your help but I am still having problems understanding these results; I've specified for 3 synonymous and 3 non-symouymous classes so I was expecting to see some thing of the form:

class 1: synonymous rate, non-synonymous rate, N/S, and proportion of sites in this class and so on for class 2 and 3.

Even under just the variations in non-synonymous rates shouldn't there be three classes each with different non-synonymous rates, N/S (omega) and proportions of sites in each class?

Thanks again for your time, NY

NY3 commented 8 years ago

Oh sorry I just looked at the output file and found what I was looking for! Just to clarify, this is one of my output files:

| Constant Rates | | Var. N.Syn. Rates |


Sample mean = 0.00166996201572312 (sample variance = 0.0001018052554746849)

Rate[0]= 0.00000000 (weight=0.9711001) Rate[1]= 0.00000063 (weight=0.0022371) Rate[2]= 0.06263263 (weight=0.0266628)

| Dual Variable Rates |


Sample mean = 1 (sample variance = 0)

Rate[0]= 0.28118011 (weight=0.0000000) Rate[1]= 1.00000000 (weight=0.6968643) Rate[2]= 1.00000000 (weight=0.3031357)


Sample mean = 0.001668490802045871 (sample variance = 0.0001016588951516069)

Rate[0]= 0.00000000 (weight=0.0862732) Rate[1]= 0.00000628 (weight=0.8872502) Rate[2]= 0.06280689 (weight=0.0264767)


Sample mean = 0.001668490802045871 (sample variance = 0.000101658895151607)

Rate[0]= 0.00000000 (weight=0.0000000) Rate[1]= 0.00000000 (weight=0.0601207) Rate[2]= 0.00000000 (weight=0.0261525) Rate[3]= 0.00000628 (weight=0.6182930) Rate[4]= 0.00000628 (weight=0.2689572) Rate[5]= 0.00002234 (weight=0.0000000) Rate[6]= 0.06280689 (weight=0.0184507) Rate[7]= 0.06280689 (weight=0.0080260) Rate[8]= 0.22336889 (weight=0.0000000)

So the first 3 rates under the Dual variable model are the 3 synonymous rates followed by the 3 non synonymous rates. And the last 9 rates are the combinations of all possible synonymous and nonsymonymous rates, but how are the weights for these (the last 9) rates calculated?

Also, would I be able to automate this selection process? If so, which TemplateBatchfile do I need to initially run to figure out the inputredirect options for my script?

Thanks very much for your help, NY