iqtree / iqtree2

NEW location of IQ-TREE software for efficient phylogenomic software by maximum likelihood http://www.iqtree.org
GNU General Public License v2.0
231 stars 55 forks source link

Feature request: Generate MrBayes block from IQ-TREE #195

Open eosthusiast opened 3 months ago

eosthusiast commented 3 months ago

A very useful feature would be for IQ-TREE to automatically generate a full MrBayes command block with lset and prset parameters for all DNA and/or protein partitions specified in the partition file.

The feature would basically do something that has the functionality of PhyloSuite or partitNex2MrBayes, both of which are still running PartitionFinder2 and are very slow off the cluster.

Currently, IQ-Tree generates a charsets block for MrBayes in the best_model.nex file. The expanded feature would generate the relevant lset and prset commands for each partition to run the respective substitution model of each partition.

Specifically, this would apply for any runs with the -mset mrbayes command, but could also be generalised as suggested by @roblanf here:

"My suggestion (if we get enough interest to implement the feature) would be to do what I do in PartitionFinder, which is to always output the MrBayes block, and just substitute models if they are not available in MrBayes. E.g. in PF2 if there's a model which isn't in MrBayes, for DNA I just substitute GTR. This is because overparameterisation is less of a problem in Bayesian setting (since uncertainty gets integrated out anyway), so setting simpler models to GTR is generally fine.

Either way, this provides a template MrBayes block which is easy to modify." https://github.com/iqtree/iqtree2/discussions/191#discussioncomment-9412801

Below is an example using multiple DNA partitions

The current output best_model.nex file looks like this:

#nexus
begin sets;
charset Subset1 = 1-795 2076-2823;
charset Subset2 = 796-1753;
charset Subset3 = 2887-3503 1754-1946 2824-2886 4249-5247 3504-3596 3662-4171;
charset Subset4 = 5248-5451 5519-5655 5722-5865 1947-2075;
charset Subset5 = 5656-5721 5452-5518 4172-4248 3597-3661;
  charpartition mymodels =
    GTR{1.16415,2.55034,1.69295,0.622315,3.64637}+F{0.250129,0.214643,0.218113,0.317114}+I{0.359256}+G4{1.5706}: ITS{3.17254},
    SYM{0.865421,2.75866,0.464508,1.36401,7.50627}+FQ+I{0.337416}+R2{0.44208,0.465724,0.220504,3.60136}: RBP1-1{8.99614},
    HKY{3.05489}+F{0.264998,0.223072,0.178098,0.333832}+I{0.0657102}+G4{0.689527}: TIntron1{47.8366},
    GTR{0.0001,1.44403,1.41316,0.0001,17.7757}+F{0.263499,0.27783,0.249702,0.208969}+I{0.23235}+G4{0.134421}: Tef-2{3.03846},
    HKY{5.12509}+F{0.251637,0.219448,0.168646,0.360268}+I{0.0674745}+G4{1.15428}: TIntron2{35.6943};
end;

The desired output best_model.nex file from this new feature would be something like this (note: the substutition models I specified in the examples may not be accurate, they just serve to illustrate the structure of the MrBayes block):

#nexus
begin mrbayes;
charset Subset1 = 1-795 2076-2823;
charset Subset2 = 796-1753;
charset Subset3 = 2887-3503 1754-1946 2824-2886 4249-5247 3504-3596 3662-4171;
charset Subset4 = 5248-5451 5519-5655 5722-5865 1947-2075;
charset Subset5 = 5656-5721 5452-5518 4172-4248 3597-3661;

partition PartitionFinder = 5:Subset1, Subset2, Subset3, Subset4, Subset5;
set partition=PartitionFinder;

lset applyto=(1) nst=6 rates=invgamma;
lset applyto=(2) nst=6 rates=invgamma;
lset applyto=(3) nst=6 rates=invgamma;
prset applyto=(3) statefreqpr=fixed(equal);
lset applyto=(4) nst=6 rates=invgamma;
lset applyto=(5) nst=2 rates=invgamma;

prset applyto=(all) ratepr=variable;

#### other user-specified MrBayes commands here
end;

I have not used protein models myself so can't provide an example for this use case. Is someone else able to amend an example for proteins please?

roblanf commented 3 months ago

This is good enough to start with, thanks @eosthusiast. If you can get 10 thumbs up, we'll add to our list of features to consider. (No promises though, sorry! There's a long list of features people would like, and there's no way we can do all of them; and we have our own research to do too. Nevertheless, we really do want to support users, and this approach seems like a reasonably fair way to do so).

roblanf commented 3 months ago

I'll also just add here for my own notes that in https://github.com/iqtree/iqtree2/discussions/191#discussioncomment-9412801 I outline the relevant functions of PartitionFinder2 that can be adapted, because they already do a lot of the translation into MrBayes' input format.

protokute commented 3 months ago

This is an example of a MrBayes block that I have with all important priors and parameters:

begin mrbayes;

charset part1 = 1-269 427-691; charset part2 = 270-426 1626-1807\3 2286-2356\3 3198-3521\3; charset part3 = 692-1625 1628-1807\3 1870-2006\3 1871-2006\3 2072-2217\3 2288-2356\3 2898-3143\3 2899-3143\3 3197-3521\3; charset part4 = 1627-1807\3 1872-2006\3 2071-2217\3 2287-2356\3 2897-3143\3; charset part5 = 1808-1869 2007-2069; charset part6 = 2070-2217\3 2357-2896; charset part7 = 2218-2285 3144-3196 3199-3521\3; partition Favolus = 7: part1, part2, part3, part4, part5, part6, part7; set partition = Favolus; outgroup Trametes_conchifer_FP106793sp;

lset applyto=(1) nst=2 rates=gamma ngammacat=4; prset applyto=(1) revmatpr=fixed(1, 3.345048737, 1, 1, 3.345048737, 1) shapepr=fixed(0.477033) statefreqpr=dirichlet(0.224266,0.226241,0.237902,0.311591);

lset applyto=(2) nst=2 rates=propinv ngammacat=4; prset applyto=(2) revmatpr=fixed(1, 9.142173815, 1, 1, 9.142173815, 1) pinvarpr=fixed(0.930324);

lset applyto=(3) nst=6 rates=invgamma ngammacat=4; prset applyto=(3) revmatpr=fixed(1.299875016, 3.606095304, 1.996952633, 0.9918935694, 11.66345025, 1) statefreqpr=dirichlet(0.272392,0.205812,0.304472,0.217324) pinvarpr=fixed(0.700745);

lset applyto=(4) nst=2 rates=invgamma ngammacat=4; prset applyto=(4) revmatpr=fixed(1, 7.965182054, 1, 1, 7.965182054, 1) shapepr=fixed(1.77888) statefreqpr=dirichlet(0.117995,0.396033,0.257864,0.228108) pinvarpr=fixed(0.195018);

lset applyto=(5) nst=2 rates=propinv ngammacat=4; prset applyto=(5) revmatpr=fixed(1, 2.875739469, 1, 1, 2.875739469, 1) statefreqpr=dirichlet(0.222328,0.214478,0.206865,0.35633) pinvarpr=fixed(0.106448);

lset applyto=(6) nst=2 rates=gamma ngammacat=4; prset applyto=(6) revmatpr=fixed(1, 4.363429921, 1, 1, 4.363429921, 1) statefreqpr=dirichlet(0.174006,0.290131,0.270597,0.265266) shapepr=fixed(0.368305);

lset applyto=(7) nst=6 rates=gamma ngammacat=4; prset applyto=(7) revmatpr=fixed(4.672533274, 12.22736915, 2.36996032, 2.36696048, 13.49454344, 1) statefreqpr=dirichlet(0.182876,0.299937,0.221483,0.295704) shapepr=fixed(2.64497);

end;

The lset parameters are manually added based on prior knowledge of the substitution models supported by MrBayes and IQTREE doesn't include it in its output, but I don't think it would be hard to implement this.

"fixed" priors means the values for the nucleotide substitutions are defined as constant and "drichlet" means that the prior distribution allows for variability in these parameters, guided by the Dirichlet distribution's shape parameters.

"revmat" refers to the "reversible rate matrix," which is a component of the model used to describe the rates at which one nucleotide changes into another in a nucleotide substitution model. The priors for these rates can be found in the .ckp output file as "ModelDNA!rates" of each partition.

"statefreq" refers to the equilibrium or stationary frequencies of the nucleotide states (A, C, G, T) in a DNA sequence. The priors for these can be found in best_model.nex output file.

"shapepr" refers to the prior distribution for the shape parameter of the gamma distribution. The prior can be found in both .ckp and best_model.nex output files.

"pinvar" refers to the proportion of invariable sites in the sequence alignment. The prior can be found in both .ckp and best_model.nex output files.

roblanf commented 3 months ago

Got it. Surely no true Bayesian would fix parameters? This seems like a dangerous route to go down. Even estimating the structure of the model in a frequentist framework and fixing that is a bit dodgy, and not really advised (though I know people do it, which is why we added that to PartitionFinder).

So, I think in terms of this feature we should avoid implementing anything which fixes parameters.

IntegerLimit commented 2 months ago

Is this MrBayes Block Output needed in a scenario where IQTree is run without partitions?

bqminh commented 2 months ago

Is this MrBayes Block Output needed in a scenario where IQTree is run without partitions?

Yes. It should work with and without partitioning..

roblanf commented 2 months ago

The output then should be a single partition with the appropriate model. One could in principle get the same situation (i.e. one partition) by starting with partitions and running MFP+MERGE, if the MERGE happens to group together all partitions into one.

IntegerLimit commented 2 months ago

Should the block still be outputed when an non-time reversible model is used?

So far, I cannot find any documentation on MrBayes' support of non-time reversible models, but it may be hidden in the documentation somewhere.

IntegerLimit commented 2 months ago

Also, for replacing the +R modifier:

roblanf commented 2 months ago
  1. My opinion is that the block should be output when a non-reversible model is used, or when ANY model is used, regardless of whether the models are in MrBayes. In these cases, one needs to pick a substitute model that IS in MrBayes. My suggestion here is just to default to the GTR+I+G model, and also put a warning in the output (and ideally in the MrBayes block if it's possible to add comments).

  2. As above, if the rate distribution isn't in MrBayes, I think you're right: just replace it with +I+G, print a warning, and ideally add that warning to the MrBayes block too if you can. (to be clear, either +R or +I+R should be replaced with +I+G, because typically selecting any of these models means that +I+G will be the best rate distribution avaialble in MrBayes.

You can see a lot of these substitutions by studying the python code of PartitionFinder too - the relevant functions are noted in https://github.com/iqtree/iqtree2/discussions/191#discussioncomment-9412801