iqtree / iqtree2

NEW location of IQ-TREE software for efficient phylogenomic software by maximum likelihood http://www.iqtree.org
GNU General Public License v2.0
227 stars 54 forks source link

IQtree uses an overly stringent constraint tree parsers #226

Open corneliusroemer opened 1 month ago

corneliusroemer commented 1 month ago

For my SARS-CoV-2 Nextclade trees I make extensive use of constraint trees (see e.g. previous issues #102, #144, . The following are some thoughts on making constraint tree validation throw a more helpful error message.

I was struggling to debug the error ERROR: ERROR: Taxon in constraint tree does not appear in full tree:

iqtree2 -ntmax 4 -s builds/BA.2.86/masked_masked_without_recombinants-delim.fasta -m GTR -ninit 2 -n 2 -me 0.05 -nt AUTO -redo -czb -g c.nwk -ninit 1 -n 1 > builds/BA.2.86/masked_masked_without_recombinants-delim.iqtree.log
ERROR: ERROR: Taxon  in constraint tree does not appear in full tree
ERROR: Bad constraint tree (see above)

As my constraint tree has around a thousand external and internal nodes the error message is hard to action. It seems that I trigger an edge case that isn't considered in the code, as there's a double space ` afterTaxoninERROR: ERROR: Taxon in constraint tree does not appear in full tree`, it looks like in my case something is empty that shouldn't be empty.

I figured out a minimal reproducible example of a constraint tree that triggers this error:

(
  JN.1.18.5,
  (
    LS.1
  )something
);

When looking at it like this it's obvious: there's an unnecessary paren around LS.1, I should remove the ( ... )something. But as far as I know this is still valid newick.

In any case, interestingly, you throw a helpful error message when I get rid of the something.

Using this:

(
  JN.1.18.5,
  (
    LS.1
  )
);

throws ERROR: Redundant double-bracket ‘((…))’ with closing bracket ending at (line 5 column 1)

So there are two things I suggest:

a) throw a more helpful error message, similar to the redundancy one, in case of an internal node that is redundant and labelled (not redundant and unlabelled, this case you've got covered) b) consider allowing redundant internal nodes. All newick definitions/grammars I've come across do not forbid redundancy. E.g. https://phylipweb.github.io/phylip/newick_doc.html or https://en.wikipedia.org/wiki/Newick_format

Note: I'm using IQ-TREE multicore version 2.3.4 COVID-edition for MacOS Intel 64-bit built May 23 2024

corneliusroemer commented 1 month ago

I've added a rule to catch this issue early in my Newick formatting/linting utility: https://github.com/corneliusroemer/nwk-formatter (which helps me explain why certain constraint trees fail with IQtree)

bqminh commented 1 month ago

I see, thanks for pointing this out. Can you give us some simple input files to help debug this?