iqtree / iqtree2

NEW location of IQ-TREE software for efficient phylogenomic software by maximum likelihood http://www.iqtree.org
GNU General Public License v2.0
245 stars 57 forks source link

Composition test in IQ-TREE2 (and IQ-TREE) misleading #78

Closed lsjermiin closed 5 months ago

lsjermiin commented 2 years ago

I am concerned that the composition test in IQ-TREE and IQ-TREE2 misleads users because the p-value reported for each sequence is biased towards higher values. To follow what I mean, consider two sequences, S1 and S2, with the following compositions of nucleotides:

S1 S2 A 120 80 C 80 120 G 120 80 T 80 120

The expected values are

A 100 C 100 G 100 T 100

IQ-TREE2's chi2 test of compositional homogeneity uses the following equation for each sequence:

chi2 = \sum_{i=1}^k (O_i - E_i)^2 / E_i

In this case, chi2 = 16 for both S1 and S2. With df = 3 in each case, IQ-TREE2 returns a p-value of 0.0011 (or 0.11%) for each sequence.

The test is similar to those of Preparata and Saccone (1987) and Pearson (1900), in the sense that they all consider a difference between the observed and expected values.

Now using Pearson's (1900) chi2 test of homogeneity gave a different result (because it compares S1 to S2, bearing in mind the expected values). Pearson's (1900) test obtained chi2 = 32. With df = 3, we obtain a p-value of 5.233466e-07.

Given the difference between these two p-values, there is likely to be many cases where IQ-TREE2 will report a pass for the test, which should have been a fail.

To understand why there is a difference, we simply have to see how many comparisons are done for each chi2 test. For Pearson's (1900) test, there are four comparisons: S1_A vs E_A S1_C vs E_C S1_G vs E_G S1_T vs E_T S2_A vs E_A S2_C vs E_C S2_G vs E_G S2_T vs E_T

That is, chi2 is a sum of eight comparisons. For the test in IQ-TREE2 there are four comparison for each sequence:

S1_A vs E_A S1_C vs E_C S1_G vs E_G S1_T vs E_T That is, chi2 for S1 is a sum of four comparisons, and:

S2_A vs E_A S2_C vs E_C S2_G vs E_G S2_T vs E_T That is, chi2 for S2 is a sum of four comparisons.

In other words, we now know where the problem arise. We use df = 3 in both tests, and that is not appropriate for the test in IQ-TREE2.

I am not yet sure what the solution is, but in this case multiplying the chi2 value by 2 would solve the problem. I have no idea whether that solution is generally applicable but I doubt it.

References: Pearson, K. 1900. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Phil. Mag. Series 5. 50:157–175.

Preparata G., Saccone C. 1987. A simple quantitative model of the molecular clock. J. Mol. Evol. 26, 7–15.

bqminh commented 5 months ago

Sorry, but I'm not convinced to change the test. Your understanding is right, and it's explained in the FAQ (http://www.iqtree.org/doc/Frequently-Asked-Questions#what-is-the-purpose-of-composition-test). IQ-TREE is testing whether S1 composition is significantly different the from the average of S1 and S2 (in this case, or the average of all sequences in general case). Whereas you are suggesting to test whether S1 composition is different from S2 composition. But, how does it work with a multiple sequence alignment with many S? I'd just say that you have a different test than us.

Moreover, we wrote in the FAQ that:

This test should be regarded as an explorative tool which might help to nail down problems in a dataset. One would typically not remove failing sequences by default. But if the tree shows unexpected topology the test might point in direction of the origin of the problem....

So I don't see a problem with this test.