Gamma discretization in AliSim not consistent with tree estimation

During my research I have encountered what might be a bug or perhaps a problem in the model description in AliSim. The problem arises when simulating sequences specifying a discretized gamma distribution.

The following example is run on Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-148-generic x86_64) with

iqtree2 –version
IQ-TREE multicore version 2.3.6 for Linux x86 64-bit built Aug  1 2024
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor, Heiko Schmidt,
Dominik Schrempf, Michael Woodhams, Ly Trong Nhan, Thomas Wong

The following analysis uses the tree: (((Reference_1:0.5,Reference_2:0.5):0.49,Sample_1:0.79):0.01,Reference_3:1.0);

When simulating sequences using

iqtree2 --alisim ./alignment -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25}+G4{0.1} --length 100000 -af fasta -t ./tree.nwk -seed 1234

and then feeding those sequences into IQ-TREE and estimating branch lengths only with the following command

iqtree2 -s ./alignment.fa -te ./tree.nwk -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25}+G4{0.1} -seed 1234

I obtained a log likelihood ratio of 510.878 when comparing the likelihood under the true model (initial log-likelihood) to that obtained when estimating branch lengths (optimal log-likelihood). I can reject the true simulation model with strong statistical significance when using a likelihood ratio test. Furthermore, the branch length estimates appear biased in repeated simulations.

I have confirmed that the problem relates to the discretization of the gamma distribution, because if I instead simulate directly from the discretized distribution for alpha = 0.1 using the following

iqtree2 --alisim ./alignment_cat1 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 5.265193e-07 -seed 1234
iqtree2 --alisim ./alignment_cat2 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 1.078089e-03 -seed 1234
iqtree2 --alisim ./alignment_cat3 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 9.375338e-02 -seed 1234
iqtree2 --alisim ./alignment_cat4 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 3.905168e+00 -seed 1234

and concatenate the sequences, the likelihood ratio test is no longer significant (4.114). The simulation model then seems to match the estimation model in IQ-TREE.

The estimation in IQ-TREE seems to work well as I have also compared that to other programs such as RAxML and phangorn. I also tested alpha = 1 and found the log likelihood ratio to be significant (408.518), but at alpha = 10 the log likelihood ratio was no longer rejected (9.454). By reducing the rate variation by increasing the alpha parameter, the models between sequence simulation and tree estimation start to match each other better. The observations also seem to hold when I scale the tree to shorter branch lengths.

It seems that AliSim either has some error here or perhaps is not using the discretization using the mean as described in the documentation and stated in the log.

Thank you for your time. Please let me know if there are any other questions/things that need to be clarified.

Dear Maya, Thank you for reporting the issue. I'll have a look and will get back to you once I find something.

Nhan

On Thu, Oct 17, 2024 at 5:55 AM Maya Lemmon-Kishi @.***> wrote:

During my research I have encountered what might be a bug or perhaps a problem in the model description in AliSim. The problem arises when simulating sequences specifying a discretized gamma distribution.

The following example is run on Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-148-generic x86_64) with

iqtree2 –version IQ-TREE multicore version 2.3.6 for Linux x86 64-bit built Aug 1 2024 Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams, Ly Trong Nhan, Thomas Wong

The following analysis uses the tree:

(((Reference_1:0.5,Reference_2:0.5):0.49,Sample_1:0.79):0.01,Reference_3:1.0);

When simulating sequences using

iqtree2 --alisim ./alignment -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25}+G4{0.1} --length 100000 -af fasta -t ./tree.nwk -seed 1234

and then feeding those sequences into IQ-TREE and estimating branch lengths only with the following command

iqtree2 -s ./alignment.fa -te ./tree.nwk -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25}+G4{0.1} -seed 1234

I obtained a log likelihood ratio of 510.878 when comparing the likelihood under the true model (initial log-likelihood) to that obtained when estimating branch lengths (optimal log-likelihood). I can reject the true simulation model with strong statistical significance when using a likelihood ratio test. Furthermore, the branch length estimates appear biased in repeated simulations.

I have confirmed that the problem relates to the discretization of the gamma distribution, because if I instead simulate directly from the discretized distribution for alpha = 0.1 using the following

iqtree2 --alisim ./alignment_cat1 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 5.265193e-07 -seed 1234 iqtree2 --alisim ./alignment_cat2 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 1.078089e-03 -seed 1234 iqtree2 --alisim ./alignment_cat3 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 9.375338e-02 -seed 1234 iqtree2 --alisim ./alignment_cat4 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 3.905168e+00 -seed 1234

and concatenate the sequences, the likelihood ratio test is no longer significant (4.114). The simulation model then seems to match the estimation model in IQ-TREE.

The estimation in IQ-TREE seems to work well as I have also compared that to other programs such as RAxML and phangorn. I also tested alpha = 1 and found the log likelihood ratio to be significant (408.518), but at alpha = 10 the log likelihood ratio was no longer rejected (9.454). By reducing the rate variation by increasing the alpha parameter, the models between sequence simulation and tree estimation start to match each other better. The observations also seem to hold when I scale the tree to shorter branch lengths.

It seems that AliSim either has some error here or perhaps is not using the discretization using the mean as described in the documentation and stated in the log.

Thank you for your time. Please let me know if there are any other questions/things that need to be clarified.

— Reply to this email directly, view it on GitHub https://github.com/iqtree/iqtree2/issues/333, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZPMLGALZ7VN2AYR4XSM73Z32ZBDAVCNFSM6AAAAABQCDX6DSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGU4TENZXHA2TEMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Hi @lemmonquiche, I figured out the issue is due to the Pseudo random generators as I was using the same random seed for both the random assignment of sequence sites to one of four rate categories and the simulation of substitutions. This bug has been fixed and will be included in the next release of IQ-TREE. In the meantime, could you kindly check if you encounter the same issue using this pre-release version? Many thanks!

Hi Nhan,

Thanks for the quick response and fix to AliSim. The initial results look much better. I will test it out some more in my simulation pipeline and will let you know if something comes up.

Maya

iqtree / iqtree2

Gamma discretization in AliSim not consistent with tree estimation #333