iqtree / iqtree2

NEW location of IQ-TREE software for efficient phylogenomic software by maximum likelihood http://www.iqtree.org
GNU General Public License v2.0
229 stars 55 forks source link

New bug in parsinomy code? #28

Closed roblanf closed 3 months ago

roblanf commented 3 years ago

@JamesBarbetti I think something untoward is happening in the parsimony code.

Using exactly the same data and commandline, I get very different outputs from the code compiled on 27 Apr versus 15 May.

Logs are pasted below, but the tl;dr is that I have two worries:

  1. New version doesn't tell me it's doing any optimisation (maybe you just removed print statements though)
  2. New version tells me a patently incorrect parsimony score (8207) when I know from previous versions of IQ-TREE and independent analyses (e.g. UShER) that the score is >500,000

A bug?

Rob

Log Files

27 Apr is as expected:

IQ-TREE multicore version 2.1.2 COVID-edition for Linux 64-bit built Apr 27 2021                                                            
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host:    c109762 (AVX2, FMA3, 1007 GB RAM)
Command: ../iqtree -n 0 -no-ml-dist -m JC -t usher/placed/binary-tree.nh -s aln_sampled.fa -parsimony-spr 100 -parsimony-nni 100 -spr-radius 40 --suppress-list-of-sequences -blfix -nt 100 -fast -pre iqtree_parsimony2
Seed:    328716 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Sat May 15 20:06:41 2021
Kernel:  AVX+FMA - 100 threads (256 CPU cores detected)

Reading alignment file aln_sampled.fa ... Fasta format detected
Alignment most likely contains DNA/RNA sequences
Alignment has 410000 sequences with 29628 columns, 29518 distinct patterns
23198 parsimony-informative, 2213 singleton sites, 4217 constant sites
Reading input tree file usher/placed/binary-tree.nh ... rooted tree
Before doing (up to) 100 rounds of parsimony SPR, parsimony score was 587794
Applied 1418 moves (out of 4030) (2009 still possible) in iteration 1 (parsimony now 586169) after 28 min 22 sec
Applied 132 moves (out of 297) (148 still possible) in iteration 2 (parsimony now 586019) after 46 min 48 sec
Applied 29 moves (out of 61) (31 still possible) in iteration 3 (parsimony now 585983) after 1 hrs 2 min 30 sec
Applied 3 moves (out of 7) (3 still possible) in iteration 4 (parsimony now 585980) after 1 hrs 17 min 45 sec
Applied 0 moves (out of 0) (0 still possible) in last iteration  (parsimony now 585980) (total SPR moves examined 17550096376)
Before doing (up to) 100 rounds of parsimony NNI, parsimony score was 585980
Applied 0 moves (out of 0) (0 still possible) in last iteration  (parsimony now 585980) (total NNI moves examined 819996)

NOTE: 392499 MB RAM (383 GB) is required!
Estimate model parameters (epsilon = 0.05)
1. Initial log-likelihood: -8.43037e+09
Optimal log-likelihood: -8.43037e+09
Rate parameters:  A-C: 1.00000  A-G: 1.00000  A-T: 1.00000  C-G: 1.00000  C-T: 1.00000  G-T: 1.00000
Base frequencies:  A: 0.250  C: 0.250  G: 0.250  T: 0.250
Parameters optimization took 1 rounds (19.3878 sec)
NOTE: 3.264 seconds to dump checkpoint file, increase to 66.000
BEST SCORE FOUND : -8430370020.728
Total tree length: 574019.970

Total number of iterations: 0
CPU time used for tree search: 0.282 sec (0h:0m:0s)
Wall-clock time used for tree search: 0.283 sec (0h:0m:0s)
Total CPU time used: 254665.764 sec (70h:44m:25s)
Total wall-clock time used: 6026.242 sec (1h:40m:26s)

Analysis results written to: 
  IQ-TREE report:                iqtree_parsimony2.iqtree
  Maximum-likelihood tree:       iqtree_parsimony2.treefile
  Screen log file:               iqtree_parsimony2.log

NOTE: 4.895 seconds to dump checkpoint file, increase to 98.000
Date and Time: Sat May 15 22:00:48 2021

But the very same command on the latest version (May 15th) gives me a parsimony score that is insane, and doesn't seem to do any optimisation at all (though maybe the last bit is just you removing print statemets?).

IQ-TREE multicore version 2.1.2 COVID-edition for Linux 64-bit built May 15 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host:    c109762 (AVX2, FMA3, 1007 GB RAM)
Command: ../iqtree2 -n 0 -no-ml-dist -m JC -t usher/placed/binary-tree.nh -s aln_sampled.fa -parsimony-spr 100 -parsimony-nni 100 -spr-radiu
s 40 --suppress-list-of-sequences -blfix -nt 100 -fast -pre iqtree_parsimony
Seed:    442874 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Sat May 15 17:48:43 2021
Kernel:  AVX+FMA - 100 threads (256 CPU cores detected)

Reading alignment file aln_sampled.fa ... Fasta format detected
Alignment most likely contains DNA/RNA sequences
Alignment has 410000 sequences with 29628 columns, 29518 distinct patterns
23198 parsimony-informative, 2213 singleton sites, 4217 constant sites
Reading input tree file usher/placed/binary-tree.nh ... rooted tree

NOTE: 392499 MB RAM (383 GB) is required!
Estimate model parameters (epsilon = 0.05)
1. Initial log-likelihood: -8.42783e+09
Optimal log-likelihood: -8.42783e+09
Rate parameters:  A-C: 1.00000  A-G: 1.00000  A-T: 1.00000  C-G: 1.00000  C-T: 1.00000  G-T: 1.00000
Base frequencies:  A: 0.250  C: 0.250  G: 0.250  T: 0.250
Parameters optimization took 1 rounds (23.3431 sec)
NOTE: 3.241 seconds to dump checkpoint file, increase to 65.000
Parsimony score of initial tree: 8671
BEST SCORE FOUND : -8427833300.853
Total tree length: 20.290

Total number of iterations: 0
CPU time used for tree search: 0.269 sec (0h:0m:0s)
Wall-clock time used for tree search: 0.269 sec (0h:0m:0s)
Total CPU time used: 384370.068 sec (106h:46m:10s)
Total wall-clock time used: 5422.123 sec (1h:30m:22s)

Analysis results written to: 
  IQ-TREE report:                iqtree_parsimony.iqtree
  Maximum-likelihood tree:       iqtree_parsimony.treefile
  Screen log file:               iqtree_parsimony.log

NOTE: 5.381 seconds to dump checkpoint file, increase to 108.000
Date and Time: Sat May 15 19:33:39 2021

Can you double check on some simple test cases?

Rob

roblanf commented 3 years ago

Just confirming that the tree from the new version seems not be optimised with parsimony at all. The branchlengths are ML branchlengths:

# Most common branchlengths from the May 15th version
> head(sort(a1, decreasing = TRUE))

       1e-06  3.37526e-05  6.75068e-05 0.0001012624 0.0001350196 0.0001687783 
      510267       183840        63644        28247        14205         7798 
# Most common branchlengths from the Apr 27th version
> head(sort(a2, decreasing = TRUE))

 1e-06      1      2      3      4      5 
509419 181840  62675  27831  13876   7583 
> 
JamesBarbetti commented 3 years ago

Hi Rob.

Yes, that looks awfully wrong. I’ll look at the code that executes when the initial tree is read from a file, And see if there’s something, that it used to do that it is now skipping over. Because it looks like it isn’t recalculating the tree parsimony properly. (I suspect that some parsimony code that needs to get called... isn’t getting called at all) (Sorry for slow reply: I was visiting Mum & Shopping & moving lots of pots to their winter spots!. I’ll get back to you (n Slack) later today.

Regards, James

On 16 May 2021, at 9:23 am, roblanf @. @.>> wrote:

@JamesBarbetti https://github.com/JamesBarbetti I think something untoward is happening in the parsimony code.

Using exactly the same data and commandline, I get very different outputs from the code compiled on 27 Apr versus 15 May.

Logs are pasted below, but the tl;dr is that I have two worries:

New version doesn't tell me it's doing any optimisation (maybe you just removed print statements though) New version tells me a patently incorrect parsimony score (8207) when I know from previous versions of IQ-TREE and independent analyses (e.g. UShER) that the score is >500,000 A bug?

Rob

Log Files

27 Apr is as expected:

IQ-TREE multicore version 2.1.2 COVID-edition for Linux 64-bit built Apr 27 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung, Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host: c109762 (AVX2, FMA3, 1007 GB RAM) Command: ../iqtree -n 0 -no-ml-dist -m JC -t usher/placed/binary-tree.nh -s aln_sampled.fa -parsimony-spr 100 -parsimony-nni 100 -spr-radius 40 --suppress-list-of-sequences -blfix -nt 100 -fast -pre iqtree_parsimony2 Seed: 328716 (Using SPRNG - Scalable Parallel Random Number Generator) Time: Sat May 15 20:06:41 2021 Kernel: AVX+FMA - 100 threads (256 CPU cores detected)

Reading alignment file aln_sampled.fa ... Fasta format detected Alignment most likely contains DNA/RNA sequences Alignment has 410000 sequences with 29628 columns, 29518 distinct patterns 23198 parsimony-informative, 2213 singleton sites, 4217 constant sites Reading input tree file usher/placed/binary-tree.nh ... rooted tree Before doing (up to) 100 rounds of parsimony SPR, parsimony score was 587794 Applied 1418 moves (out of 4030) (2009 still possible) in iteration 1 (parsimony now 586169) after 28 min 22 sec Applied 132 moves (out of 297) (148 still possible) in iteration 2 (parsimony now 586019) after 46 min 48 sec Applied 29 moves (out of 61) (31 still possible) in iteration 3 (parsimony now 585983) after 1 hrs 2 min 30 sec Applied 3 moves (out of 7) (3 still possible) in iteration 4 (parsimony now 585980) after 1 hrs 17 min 45 sec Applied 0 moves (out of 0) (0 still possible) in last iteration (parsimony now 585980) (total SPR moves examined 17550096376) Before doing (up to) 100 rounds of parsimony NNI, parsimony score was 585980 Applied 0 moves (out of 0) (0 still possible) in last iteration (parsimony now 585980) (total NNI moves examined 819996)

NOTE: 392499 MB RAM (383 GB) is required! Estimate model parameters (epsilon = 0.05)

  1. Initial log-likelihood: -8.43037e+09 Optimal log-likelihood: -8.43037e+09 Rate parameters: A-C: 1.00000 A-G: 1.00000 A-T: 1.00000 C-G: 1.00000 C-T: 1.00000 G-T: 1.00000 Base frequencies: A: 0.250 C: 0.250 G: 0.250 T: 0.250 Parameters optimization took 1 rounds (19.3878 sec) NOTE: 3.264 seconds to dump checkpoint file, increase to 66.000 BEST SCORE FOUND : -8430370020.728 Total tree length: 574019.970

Total number of iterations: 0 CPU time used for tree search: 0.282 sec (0h:0m:0s) Wall-clock time used for tree search: 0.283 sec (0h:0m:0s) Total CPU time used: 254665.764 sec (70h:44m:25s) Total wall-clock time used: 6026.242 sec (1h:40m:26s)

Analysis results written to: IQ-TREE report: iqtree_parsimony2.iqtree Maximum-likelihood tree: iqtree_parsimony2.treefile Screen log file: iqtree_parsimony2.log

NOTE: 4.895 seconds to dump checkpoint file, increase to 98.000 Date and Time: Sat May 15 22:00:48 2021 But the very same command on the latest version (May 15th) gives me a parsimony score that is insane, and doesn't seem to do any optimisation at all (though maybe the last bit is just you removing print statemets?).

IQ-TREE multicore version 2.1.2 COVID-edition for Linux 64-bit built May 15 2021 Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung, Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host: c109762 (AVX2, FMA3, 1007 GB RAM) Command: ../iqtree2 -n 0 -no-ml-dist -m JC -t usher/placed/binary-tree.nh -s aln_sampled.fa -parsimony-spr 100 -parsimony-nni 100 -spr-radiu s 40 --suppress-list-of-sequences -blfix -nt 100 -fast -pre iqtree_parsimony Seed: 442874 (Using SPRNG - Scalable Parallel Random Number Generator) Time: Sat May 15 17:48:43 2021 Kernel: AVX+FMA - 100 threads (256 CPU cores detected)

Reading alignment file aln_sampled.fa ... Fasta format detected Alignment most likely contains DNA/RNA sequences Alignment has 410000 sequences with 29628 columns, 29518 distinct patterns 23198 parsimony-informative, 2213 singleton sites, 4217 constant sites Reading input tree file usher/placed/binary-tree.nh ... rooted tree

NOTE: 392499 MB RAM (383 GB) is required! Estimate model parameters (epsilon = 0.05)

  1. Initial log-likelihood: -8.42783e+09 Optimal log-likelihood: -8.42783e+09 Rate parameters: A-C: 1.00000 A-G: 1.00000 A-T: 1.00000 C-G: 1.00000 C-T: 1.00000 G-T: 1.00000 Base frequencies: A: 0.250 C: 0.250 G: 0.250 T: 0.250 Parameters optimization took 1 rounds (23.3431 sec) NOTE: 3.241 seconds to dump checkpoint file, increase to 65.000 Parsimony score of initial tree: 8671 BEST SCORE FOUND : -8427833300.853 Total tree length: 20.290

Total number of iterations: 0 CPU time used for tree search: 0.269 sec (0h:0m:0s) Wall-clock time used for tree search: 0.269 sec (0h:0m:0s) Total CPU time used: 384370.068 sec (106h:46m:10s) Total wall-clock time used: 5422.123 sec (1h:30m:22s)

Analysis results written to: IQ-TREE report: iqtree_parsimony.iqtree Maximum-likelihood tree: iqtree_parsimony.treefile Screen log file: iqtree_parsimony.log

NOTE: 5.381 seconds to dump checkpoint file, increase to 108.000 Date and Time: Sat May 15 19:33:39 2021 Can you double check on some simple test cases?

Rob

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/iqtree/iqtree2/issues/28, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP7O4DXMOB635HR363ZKSGTTN37AHANCNFSM446LHJLQ.

roblanf commented 3 years ago

@JamesBarbetti thanks!

The other odd thing I notice in the new code is that I thought you'd mentioned that you'd turned off all the ML stuff, but it's still calculating ML parameters and branch lengths. So a bug in which pathway (ML vs. Parsimony) is chosen seems likely?

Regardless, if the ML stuff can be turned off completely, it would definitely help. Right now the parsimony code for ~400K sequences uses ~130GB memory, but as soon as we shift to the likelihood part it increases to ~550GB. So switching off likelihood will buy us a lot more scale!

Rob

JamesBarbetti commented 3 years ago

Hi Rob.

You can turn off the ML stuff completely, by passing -no-ml and not passing in a -m parameter. (-no-ml-dist only turns off the calculation of the ML distance matrix, but -no-ml turns it all off entirely). Probably IQTree should complain (a warning, at least) if you’ve passed it -no-ml and a -m [model] parameter, but for now it doesn’t.

I’m gradually setting things up so you can still have the tree search (with perturbation, etc.) even with ML turned off. But there’s a lot of code that assumes ML is turned on (and doesn’t check), so that has been slow going. Feel free to experiment (with -no-ml, which ought to work even if -n isn’t zero, … but mightn’t...).

Regards, James

On 17 May 2021, at 9:54 am, roblanf @.***> wrote:

@JamesBarbetti https://github.com/JamesBarbetti thanks!

The other odd thing I notice in the new code is that I thought you'd mentioned that you'd turned off all the ML stuff, but it's still calculating ML parameters and branch lengths. So a bug in which pathway (ML vs. Parsimony) is chosen seems likely?

Regardless, if the ML stuff can be turned off completely, it would definitely help. Right now the parsimony code for ~400K sequences uses ~130GB memory, but as soon as we shift to the likelihood part it increases to ~550GB. So switching off likelihood will buy us a lot more scale!

Rob

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/iqtree/iqtree2/issues/28#issuecomment-841898003, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP7O4DRRPVSE4EQXA7NBJT3TOBLL3ANCNFSM446LHJLQ.

bqminh commented 3 months ago

We won't continue support such high alignments (Alignment has 410000 sequences with 29628 columns, 29518 distinct patterns) in the main code of IQ-TREE. New version 2.3.4.cmaple should solve it.