iqtree / iqtree2

NEW location of IQ-TREE software for efficient phylogenomic software by maximum likelihood http://www.iqtree.org
GNU General Public License v2.0
246 stars 57 forks source link

Support Exporting Analysis to MrBayes #267

Open IntegerLimit opened 4 months ago

IntegerLimit commented 4 months ago

Introduction

This PR implements https://github.com/iqtree/iqtree2/issues/195 for all MrBayes supported sequence types. (DNA/RNA, Protein, Binary, Morphological & Codon) This PR is complete, excluding any changes required further from suggestions and bug fixes after testing has been conducted.

General Implementation Details

Mapping of Heterogeneity Rates

+G -> Mapped to shapepr (Category amount mapped to ngammacat) +I -> Mapped to pinvarpr +R -> Mapped to +G and +I, retrieving values from the checkpoint file. Where this information is not available in the checkpoint file, leaves it as the default

DNA Fallbacks

For DNA, MrBayes supports three models: F81 (nst=1), HKY (nst=2) and GTR (nst=6) (excluding their fixed frequency counterparts)

Therefore, when a model is used that is not supported in MrBayes, it will default to GTR, due to the lower impact of increased parameters when using Bayesian Inference. However, on the output file with the optimized parameters, the rate matrix will also be outputted. This effectively sets the initial values for the MrBayes to be correct to the model's restrictions, but this may not be the case after the run.

Protein Fallbacks

For Protein, when a model is used that is not supported by MrBayes, a default of GTR will be used. Then, in both files, there will be a rate and state frequency matrix of the model included. The rate matrix will be set to fixed, unless the model used by IQTree was GTR20, in which case dirichlet will be used.

Binary, Morphological and Codon Data Exclusions

Codon Implementation Details

Codon Models in MrBayes: Introduction

The basic structure for codon models in MrBayes is quite similar to mechanistic codon models in IQTree, following the same formulation of the model by Goldman & Yang 1994 and Muse & Gaut 1994. However, the settings for MrBayes are under different names, and most inputs cannot be ported directly, making it the most difficult model to port to MrBayes format.

Instead of using named models, such as MG or GY, MrBayes uses five main parameters:

(Source: MrBayes Manual (Chapter 6.1.3 & Appendix A), MrBayes help lset and help prset commands, IQTree Documentation on Substitution Models (Section on Codon Models))

Mechanistic Model Output

Nucleotide Substitution Model

For retrieving the nucleotide substitution model that should be used as input into MrBayes, the implemented code does the following:

This implementation means that GTR is not used, appropriate considering the inputs for Mechanistic Codon Models in IQTree (ds/dt ratio + ts/tv ratio).

Note that fix_kappa is only set to true under the Codon Models MGK and GY0K (which are the only models without a ts/tv input ratio). This can be shown through the initCodon function, which only calls initMG94 or initGY94 with fix_kappa as true for those two models. That input is then read into the fix_kappa field here for MG Models and here for GY Models.

TS/TV Rate Ratio

In IQTree, ModelCodon stores two values for the kappa, or the ts/tv rate ratio:

However, for the tratiopr input in MrBayes, there are two types of input available:

Since it is preferable to have variable inputs in Bayesian Inference, the preference would be to use the beta input. In general, the beta input should be in the form beta(ax, x), where a is the ts/tv ratio, and x represents how close to the ratio the data is. (Source 1)

Source 1: (From the MrBayes help prset page)

If you think it is likely that the transition/transversion rate ratio is 2.0, you can use a prior of the type beta(2x,x), where x determines how strongly the prior is concentrated on tratio values near 2.0.

Therefore, the ratio is figured out as below:

(Source: Usage of Kappa and Kappa2 in Four Functions)

DN/DS Rate Ratio

ModelCodon stores the dn/ds rate ratio in omega. Similar to ts/tv, we do not have the non-synonymous rate and the synonymous rate individually.

MrBayes, again similar to ts/tv, has two possible inputs for omegapr.

However, as we don't know the rates themselves, the output would simply be the first case in the ts/tv ratio calculation: dirichlet(omega, 1).

State Frequencies

This is simply retrieved from state_freq. However, the implementation also skips indices of state_freq where phylo_tree->aln->isStopCodon(i) returns true.

Omega Variation

Since IQTree has no values for Omega Variation (which variates the ds/dt ratio across codons, based on how they affect fitness), this has been set to equal. The user can still change this value themselves if they wish to do so.

This could be improved in the future, so that it and its other priors, are sourced from the gamma distribution.

Rate Heterogeneity Modifiers

MrBayes does not support +G or +I (or +R, although it doesn't support that in any sequence type) for Codon Models. This has been excluded, and a warning printed to the file and log.

Rate Matrix

This has also been discarded, as there is no input in MrBayes for the rate matrix.

Empirical Model Output

MrBayes does not support Empirical Codon Models, so when such a model is being used (or a mixture of Empirical + Mechanistic), a warning is printed to the log and file. However, a model is still outputted, with nst = 1, no kappa/omega values, and the provided state frequencies.

Codon Codes

Whilst IQTree uses Number IDs for its Codon Codes (CODON1, CODON2, etc.), MrBayes uses Text IDs. (vertmt, invermt, etc.) There is no clear documentation or description for most of the models, but below shows the final table to transfer from IQTree Codon Codes to MrBayes Codon Codes.

An XXX in the MrBayes column represents a code that MrBayes does not support.

IQTree MrBayes
CODON1 universal
CODON2 vertmt
CODON3 yeast
CODON4 mycoplasma
CODON5 invermt
CODON6 ciliate
CODON9 echinoderm
CODON10 euplotid
CODON11 universal
CODON12 XXX
CODON13 XXX
CODON14 XXX
CODON16 XXX
CODON21 XXX
CODON22 XXX
CODON23 XXX
CODON24 XXX
CODON25 XXX

If a code is used that MrBayes does not support, it defaults to the universal code, and prints a warning to the log and file.

Testing

The base cases, and some edge cases, have been tested across all data types and some models. The test case simply was to check the presence of the warnings, and that the output files import without errors into MrBayes.

However, further tests may need to be conducted on whether the input values into MrBayes produce an acceptable and reliable result.

A non exclusive list of tested scenarios are listed below.

IntegerLimit commented 4 months ago

The compilation issue on windows seems to stem from the workflow itself (as shown by its failure on https://github.com/iqtree/iqtree2/pull/269), possibly an update of a tool or library installed by the workflow (which hasn't been running on the master branch since https://github.com/iqtree/iqtree2/commit/4c9fd180e3210f93b9d1bc5fef2c2b1dc2f10496 for some reason).

I am investigating it now.

roblanf commented 4 months ago

Folks I think we need to meet about the implementation here. I am concerned that there seems to be a misunderstanding about what the parameters mean in MrBayes, and how Bayesian analysis works.

A quick example, the PR says

+G -> Mapped to shapepr

But this doesn't make sense. First, +G is a parameter estimated by ML, but shapepr is a prior on that parameter. So it's not clear what this mapping is doing. Second, in a Bayesian setting you cannot set your priors after looking at the data you're analysing. Then they're not priors.

Another thing you can't do in a Bayesian analysis is set any variables after looking at the data. To be on the safe side, I would even avoid initialising variables. Initial values of variables should be drawn straight from the priors, and since we have no business setting priors by looking the data, there is no output from IQ-TREE that should ever be used to establish priors for a Bayesian analysis.

Even the initial notion of setting the structure of the model isn't all that great of an idea, but I think that's OK since MrBayes doesn't allow for RJMCMC to integrate over model structures, so one has to pick a structure somehow, and ML seems like an OK method to do that (noting though, that even this isn't necessarily guaranteed to be good - ML with AIC/BIC is trying to choose a 'good' number of parameters to avoid overparameterisation, but Bayesian methods are much much more robust to overparameterisation).

IntegerLimit commented 4 months ago

Folks I think we need to meet about the implementation here. I am concerned that there seems to be a misunderstanding about what the parameters mean in MrBayes, and how Bayesian analysis works.

A quick example, the PR says

+G -> Mapped to shapepr

But this doesn't make sense. First, +G is a parameter estimated by ML, but shapepr is a prior on that parameter. So it's not clear what this mapping is doing. Second, in a Bayesian setting you cannot set your priors after looking at the data you're analysing. Then they're not priors.

Another thing you can't do in a Bayesian analysis is set any variables after looking at the data. To be on the safe side, I would even avoid initialising variables. Initial values of variables should be drawn straight from the priors, and since we have no business setting priors by looking the data, there is no output from IQ-TREE that should ever be used to establish priors for a Bayesian analysis.

Even the initial notion of setting the structure of the model isn't all that great of an idea, but I think that's OK since MrBayes doesn't allow for RJMCMC to integrate over model structures, so one has to pick a structure somehow, and ML seems like an OK method to do that (noting though, that even this isn't necessarily guaranteed to be good - ML with AIC/BIC is trying to choose a 'good' number of parameters to avoid overparameterisation, but Bayesian methods are much much more robust to overparameterisation).

Sorry, I didn't realise that the priors of Bayesian Inference models were so different to the values produced by ModelFinder and IQTree. Unfortunately, I can't be involved in any in-person meetings, but I can make any changes that are decided upon.

Thank you for your detailed response.