nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 129 forks source link

UFBoot values are parsed incorrectly by refine and interfere with ancestral resulting in odd annotations #856

Open ammaraziz opened 2 years ago

ammaraziz commented 2 years ago

Current Behavior
Augur version 14 introduced --tree-builder-args which allows the user to create a tree with bootstrap values. IQTree has two methods bootstrapping trees, traditional bootstrap and UltraFast Bootstrap (UFBoot). UFBoot is recommended to be run with the SH-aLRT test, in practice these two are nearly always run together. IQTree will annotate the tree with UFBoot/SH-aLRT:branch length, eg: 75.6/72:0.00056165.

This causes two issues:

  1. UFBoot values are replacing node names.
  2. UFBoot values are not being parsed by refine and not annotated in the branch_lengths.json

Number 1 is bad because of the silent nature of the issue, it interferes with ancestral resulting in repeating branch annotations that can be hard to notice.

As an example, refine output from a tree with UFBoot:

"nodes": {
    "0/45": {
      "branch_length": 1e-06
    },
    "0/58": {
      "branch_length": 0.00623938

And the same for ancestral:

  "nodes": {
    "0/45": {
      "muts": [],
      "sequence": "........truncated"
    },
    "0/58": {
      "muts": [
        "C137T",
        ....truncated
      ],

Note that the node names are replaced with UFBoot values.

Expected behavior
UFBoot values should be parsed like standard bootstrapping values and not interfere with node names.

How to reproduce

Augur tree was run with tree_builder_args = "-B 1000 -alrt 1000 -bnni" to produce the UFBoot annotated newick tree that is used in refine/ancestral steps.

Possible solution
I think the issue is that UFBoot values are treated as strings while standard bootstrap values are integers. Correctly parsing the UFBoot values would partially solve the issue.

Your environment: if running Nextstrain locally

ammaraziz commented 2 years ago

@huddlej Is there anything I can help with triaging this bug? I can dig deeper into the code to see where it fails if it helps.

tsibley commented 2 years ago

@ammaraziz Could you share a example input tree, i.e. what you're passing for the --tree option to augur refine or augur ancestral?

ammaraziz commented 1 month ago

Coming back to this .... 2 years later. My apologies for dropping the ball on replying. This issue still exists, and can cause very strange nextstrain visualisation: image

I'll generate a reproducible example.

ammaraziz commented 2 weeks ago

Hi @tsibley here is a repex:

augur align --sequence sequences.fasta --output aligned.fasta --reference-name "NC_007362.1" --nthreads 10

augur tree --alignment aligned.fasta --output tree.ufboot.nwk --nthreads 10 --override-default-args --tree-builder-args="-B 1000 -alrt 1000 -bnni" --nthreads 10

augur tree --alignment aligned.fasta --output tree.boot.nwk --nthreads 10 --override-default-args --tree-builder-args="-b 100" --nthreads 10

augur refine --tree tree.boot.nwk --alignment aligned.fasta --metadata metadata.csv --output-tree tree.boot.refined.nwk --output-node-data nodedata.boot.json --keep-root

augur refine --tree tree.ufboot.nwk --alignment aligned.fasta --metadata metadata.csv --output-tree tree.ufboot.refined.nwk --output-node-data nodedata.ufboot.json --keep-root

Inputs:

Outputs of interest:

Comparing the nodedata outputs. With traditional bootstrapping (correct parsing):

  "nodes": {
    "NC_007362.1": {
      "branch_length": 0.04489627
    },
    "NODE_0000000": {
      "branch_length": 0.001
    },
...

With ufboot (incorrect parsing, note the node name is the bootstrap value):

 "nodes": {
    "0/15": {
      "branch_length": 1e-06
    },
    "0/29": {
      "branch_length": 1e-06
    },
...

Digging into the code of refine.py is this chunk of code which reads the attributes into a dict: https://github.com/nextstrain/augur/blob/da1c89d4b3232aa977b69fb8df33d666532c9a56/augur/refine.py#L88-L96

The T object is created by read_tree function in utils.py which it's a wrapper around biopython: https://github.com/nextstrain/augur/blob/master/augur/utils.py#L68

Printing the Tree object from the biopython function identifies the issue - biopython is not correctly parsing nhx trees:

Tree(rooted=False, weight=1.0)
    Clade(branch_length=0.0)
        Clade(branch_length=0.04490861, name='NC_007362.1')
        Clade(branch_length=0.05721486, name='100/100')
            Clade(branch_length=0.02063323, name='100/100')
  ...

My conclusion is this is a biopython. Checking biopython github issues, I find this from 2017: https://github.com/biopython/biopython/issues/1464

I think the best thing to add documentation that ufboot is not supported in augur refine.

I'll try to get biopython to fix this issue.