veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
216 stars 69 forks source link

How to estimate dn and ds of the branches with bootstrap? #773

Closed Changsx closed 6 years ago

Changsx commented 6 years ago

Hi, I'm using hyphy to estimate the ds and dn of the branches based on several multialigned sequences constraining in a topological structure. I hope to know how to batch the bootstrap replicates for the sequences during this operation? Many thanks.

spond commented 6 years ago

Dear @Changsx,

Sorry, I do not understand the question. What kind of bootstrap are you referring to here? Is the objective to estimate dN and dS on the same data, but multiple trees are generated by phylogenetic bootstrap? Or to obtain bootstrap estimates of dN and dS on a fixed topology?

Best, Sergei

Changsx commented 6 years ago

Dear Prof. Sergei,

I'm sorry that I didn't describe the question clearly. In fact, I'm calculating the substitution rate of a series mtDNA sequences according to the method in Mower et al. (2007) (https://link.springer.com/article/10.1186/1471-2148-7-135). To estimate the standard errors for ds and dn, they deploied 100 bootstrap replicates, I hope to know to perform this operation in Hyphy.

With my best regards, Chang

spond commented 6 years ago

Dear @Changsx,

I am not entirely sure what the paper did based on their description. HyPhy allows you to estimate errors in model parameters in several different ways

1). Bootstrap, both parametric (simulate under fitted model and reestimate) and non-parametric (shuffle alignment columns with replacement and reestimate). This is quite slow.

2). Profile likelihood. This is fast.

3). Importance sampling. This is fast compared to 1, but slow compared to 2.

HyPhy does not directly report dN and dS, but rather rate estimates (see section 1.4.3 in http://www.hyphy.org/pubs/hyphybook2007.pdf)

You can get the estimates for rates and some other quantities, by running the following (assuming your HyPhy installation has everything in /usr/local/lib/hyphy), obviously using the appropriate paths for the input file and the output.

HYPHYMP -p /usr/local/lib/hyphy/TemplateBatchFiles/AnalyzeCodonData.bf 

            +-------------------+
            |Choose Genetic Code|
            +-------------------+

    (1):[Universal] Universal code. (Genebank transl_table=1).
    (2):[Vertebrate mtDNA] Vertebrate mitochondrial DNA code. (Genebank transl_table=2).
    (3):[Yeast mtDNA] Yeast mitochondrial DNA code. (Genebank transl_table=3).
    (4):[Mold/Protozoan mtDNA] Mold, Protozoan and Coelenterate mitochondrial DNA and the Mycloplasma/Spiroplasma code. (Genebank transl_table=4).
    (5):[Invertebrate mtDNA] Invertebrate mitochondrial DNA code. (Genebank transl_table=5).
    (6):[Ciliate Nuclear] Ciliate, Dasycladacean and Hexamita Nuclear code. (Genebank transl_table=6).
    (7):[Echinoderm mtDNA] Echinoderm mitochondrial DNA code. (Genebank transl_table=9).
    (8):[Euplotid Nuclear] Euplotid Nuclear code. (Genebank transl_table=10).
    (9):[Alt. Yeast Nuclear] Alternative Yeast Nuclear code. (Genebank transl_table=12).
    (10):[Ascidian mtDNA] Ascidian mitochondrial DNA code. (Genebank transl_table=13).
    (11):[Flatworm mtDNA] Flatworm mitochondrial DNA code. (Genebank transl_table=14).
    (12):[Blepharisma Nuclear] Blepharisma Nuclear code. (Genebank transl_table=15).
    (13):[Chlorophycean mtDNA] Chlorophycean Mitochondrial Code (transl_table=16).
    (14):[Trematode mtDNA] Trematode Mitochondrial Code (transl_table=21).
    (15):[Scenedesmus obliquus mtDNA] Scenedesmus obliquus mitochondrial Code (transl_table=22).
    (16):[Thraustochytrium mtDNA] Thraustochytrium Mitochondrial Code (transl_table=23).
    (17):[Pterobranchia mtDNA] Pterobranchia Mitochondrial Code (transl_table=24).
    (18):[SR1 and Gracilibacteria] Candidate Division SR1 and Gracilibacteria Code (transl_table=25).
    (19):[Pachysolen Nuclear] Pachysolen tannophilus Nuclear Code (transl_table=26).

Please choose an option (or press q to cancel selection):1

Please specify a codon data file: (`/usr/local/lib/hyphy/TemplateBatchFiles/`) /Users/sergei/Dropbox/Work/Collaborations/rates4sites/CD2.nex 

______________READ THE FOLLOWING DATA______________
10 species:{HUMAN, CHIMP, BABOON, RHMONKEY, COW, PIG, HORSE, CAT, MOUSE, RAT};
Total Sites:561;
Distinct Sites:308

               +--------------------------+
               | Select a standard model. |
               +--------------------------+

    (GY94):Goldman-Yang 94 codon model. Shared or independent dS/dN Possible Rate heterogeneity (and HM spatial correlation).
    (GY94W9):Goldman-Yang 94 codon model with nucleotide equilibrium frequencies varying based on their position in the codon (6 extra parameters). Shared or independent dS/dN Possible Rate heterogeneity (and HM spatial correlation).
    (GY94CUSTOMF3X4):Goldman-Yang 94 codon model crossed with an arbitrary nucleotide model and the F3x4 frequency estimator. Shared or independent dS/dN Possible Rate heterogeneity (and HM spatial correlation).
    (MG94):Muse-Gaut 94 codon model. Local or global parameters. Possible Rate heterogeneity (and HM spatial correlation).
    (MG94X2):Muse-Gaut 94+custom nucleotide biases codon model. Separate rate heterogeneity distributions for synonymous and non-synonymous changes.
    (MG94W9):Muse-Gaut 94 codon model with nucleotide equilibrium frequencies varying based on their position in the codon (6 extra parameters). Local or global parameters. Possible Rate heterogeneity (and HM spatial correlation).
    (MG94CUSTOM):Muse-Gaut 94 crossed with a nucleotide substituion model. Possible Rate heterogeneity (and HM spatial correlation).
    (MG94CUSTOMFREQS):Muse-Gaut 94 crossed with a nucleotide substituion model; 9 frequency parameters(positional nucleotide frequencies) are estimated from the data. Possible Rate heterogeneity (and HM spatial correlation).
    (MG94CUSTOMF1X4):Muse-Gaut 94 crossed with a nucleotide substituion model, using a simpler F1x4 codon frequency estimator. Possible Rate heterogeneity (and HM spatial correlation).
    (MG94CUSTOMCF3X4):Muse-Gaut 94 crossed with a nucleotide substituion model, using a corrected CF3x4 codon frequency estimator. Possible Rate heterogeneity (and HM spatial correlation).
    (MG94WAA):Muse-Gaut 94 crossed with a nucleotide substituion model AND several predefined non-synonymous substitution classes. Possible Rate heterogeneity (and HM spatial correlation).
    (MG94WAAUSERFREQS):Muse-Gaut 94 crossed with a nucleotide substituion model AND several predefined non-synonymous substitution classes. 9 frequency parameters(positional nucleotide frequencies) are defined by user (i.e. you). Possible Rate heterogeneity (and HM spatial correlation).
    (MG94WAAFREQS):Muse-Gaut 94 crossed with a nucleotide substituion model AND several predefined non-synonymous substitution classes. 9 frequency parameters(positional nucleotide frequencies) are estimated from the data. Possible Rate heterogeneity (and HM spatial correlation).
    (MG94WAAF61):Muse-Gaut 94 crossed with a nucleotide substituion model AND several predefined non-synonymous substitution classes. 61 codon frequency parameters are estimated in the model. Possible Rate heterogeneity (and HM spatial correlation).
    (MG94WAAF61MULTIPLE):Muse-Gaut 94 crossed with a nucleotide substituion model, several predefined non-synonymous substitution classes AND 4 separate rates for 2 and 3 step synonynymous and non-synonymous substitutions. 61 codon frequency parameters are estimated in the model. Possible Rate heterogeneity (and HM spatial correlation).
    (MG94WEX):Muse-Gaut 94 crossed with a nucleotide substituion model AND an acceptance bias based on the exhangeability metric by Stolzfus and Yampolsky.
    (ECM):Kosiol, Holmes and Goldman Empirical Codon model estimated from Pandit alignments (MBE 24:1464)
    (ECM+F):Kosiol, Holmes & Goldman Empirical Codon model estimated from Pandit alignments with alignment specific F61 frequency parameters (MBE 24:1464)
    (ECM+F+OMEGA):Kosiol, Holmes & Goldman Empirical Codon model estimated from Pandit alignments with alignment specific F61 frequency parameters and dNdS parameter (MBE 24:1464)
    (ECM+MLFREQS):Kosiol, Holmes & Goldman Empirical Codon model estimated from Pandit alignments with 9 frequency parameters(positional nucleotide frequencies) estimated from the data.
    (ECM+F3X4):Kosiol, Holmes & Goldman Empirical Codon model estimated from Pandit alignments with F3x4 frequency estimator
    (LCAP):Linear Combination of Amino Acid Properties from Conant, Wagner and Stadler (MPE 42:298)
    (MEC):Mixed empirical and mechanistic codon model of Doron-Faigenboim and Pupko (MBE 24:388-397)

 Please type in the abbreviation for the model you want to use:mg94w9

            +-------------+
            |Model Options|
            +-------------+

    (1):[Local] All model parameters are estimated independently for each branch.
    (2):[Global] Model parameters are shared by all branches, branch lengths are estimated independently.
    (3):[Global w/variation] Model parameters are shared by all branches.  Rate variation across sites modelled by user-chosen distribution, whose parameters are estimated.
    (4):[Global w/variation+HM] Model parameters are shared by all branches.  Rate variation across sites modelled by user-chosen distribution, whose parameters are estimated; rates at adjacent sites are correlated via a simple Hidden Markov model with an autocorrelation parameter lambda.

Please choose an option (or press q to cancel selection):1

A tree was found in the data file:
((((PIG:0.147969,COW:0.213430):0.085099,HORSE:0.165787,CAT:0.264806):0.058611,((RHMONKEY:0.002015,BABOON:0.003108):0.022733,(HUMAN:0.004349,CHIMP:0.000799):0.011873):0.101856):0.340802,RAT:0.050958,MOUSE:0.097950)

Would you like to use it:(Y/N)?y

Current Max: -3513.1838     (99 % done) LF Evals/Sec: 1480    CPU Load: 1.644   
______________RESULTS______________
Log Likelihood = -3513.18378890045;
Tree givenTree=((((RAT:0.0728828,MOUSE:0.115982)Node4:0.250548)Node3:0.0315201,((RHMONKEY:0.00360663,BABOON:0.00194818)Node8:0.0249648,(HUMAN:0,CHIMP:0.00180503)Node11:0.0195965)Node7:0.109615)Node2:0.0626215,HORSE:0.200688,CAT:0.263076,(PIG:0.187982,COW:0.234068)Node16:0.106635);

**********Continue with result processing (y/n)?y

     Available Result Processing Tools
     ---------------------------------

    (1):View Results
    (2):Save Results
    (3):Variance Estimates
    (4):Save Covariance Matrix
    (5):Parametric Bootstrap
    (6):Nonparametric Bootstrap
    (7):Parameter Sampler
    (8):dN-dS for Branches
    (9):Syn and non-syn trees
    (10):Ancestral Sequences
    (11):Suzuki-Gojobori Derived Adapative Selection Tool

 Please type in the abbreviation for the tool you want to use (or press q to exit):5

How many data replicates should be generated?:100

Save detailed bootstrap results to: (`/Users/sergei/Development/hyphy/res/TemplateBatchFiles/`) /Users/sergei/Desktop/bootstrap.csv

Iteration 1/100 
Iteration 2/100 
...
Iteration 100/100

The output (bootstrap.csv) in my example is going to contain a line for each replicate with columns like

Iteration,Ln-likelihood,Syn_sites,NS_sites,Total_sites,givenTree.RAT.nonSynRate,givenTree.RAT.synRate,...,TotalLength(RAT),SynLength(RAT),NonSynLength(RAT),...

You can compute dS and dN for each branch by using the expression from the linked bookchapter. For example in the example here

dS (RAT) = SynLength(RAT) * Total_sites / Syn_sites

Best, Sergei

Changsx commented 6 years ago

Dear Prof. Sergei,

Thank you very much for your detailed and patient reply, With my best wishes~~

Chang