nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

augur clade only partially assigns clade information #1239

Open cimendes opened 1 year ago

cimendes commented 1 year ago

Current Behavior

When running augur clade command the JSON file produced only has a partial list of assigned clades, with the remaining showing as "unassigned". When using the --reference option all branches are set to "unassigned"

Expected behavior

All branches should be correctly assigned with the clade information

How to reproduce

I'm using the following docker container: quay.io/biocontainers/augur:22.0.2--pyhdfd78af_0

With the following command: augur clades --tree kilifi_H3N2_new_docker_timetree.nwk --mutations kilifi_H3N2_new_docker_nt_muts.json kilifi_H3N2_new_docker_aa_muts.json --clades clades_h3n2_ha.tsv --output-node-data test_clades.json

Here are all the input and output files: augur_clade_input_output.zip

with the test_clades.json having the following content:

{
  "branches": {
    "NODE_0000006": {
      "labels": {
        "clade": "3C.2a"
      }
    },
    "SRR11445940_A_HA_H3": {
      "labels": {
        "clade": "3C.2a1"
      }
    }
  },
  "generated_by": {
    "program": "augur",
    "version": "22.0.2"
  },
  "nodes": {
    "100734_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "100954_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "109275_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "109292_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "109342_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "109562_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "109630_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "109974_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "110108_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "115485_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "115722_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "115833_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "115863_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "116143_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "116165_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "116225_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "116281_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "116354_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "116389_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "124408_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "124728_A_HA_H3": {
      "clade_membership": "3C.2a"
    },
    "133124_A_HA_H3": {
      "clade_membership": "3C.2a"
    },
    "133619_A_HA_H3": {
      "clade_membership": "3C.2a"
    },
    "134526_A_HA_H3": {
      "clade_membership": "3C.2a"
    },
    "134927_A_HA_H3": {
      "clade_membership": "3C.2a"
    },
    "135010_A_HA_H3": {
      "clade_membership": "3C.2a"
    },
    "135156_A_HA_H3": {
      "clade_membership": "3C.2a"
    },
    "135379_A_HA_H3": {
      "clade_membership": "3C.2a"
    },
    "135553_A_HA_H3": {
      "clade_membership": "3C.2a"
    },
    "135676_A_HA_H3": {
      "clade_membership": "3C.2a"
    },
    "92804_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "93547_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "94414_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "99056_A_HA_H3": {
      "clade_membership": "unassigned"
    },
    "NODE_0000000": {
      "clade_membership": "unassigned"
    },
    "NODE_0000002": {
      "clade_membership": "unassigned"
    },
    "NODE_0000003": {
      "clade_membership": "unassigned"
    },
    "NODE_0000005": {
      "clade_membership": "unassigned"
    },
    "NODE_0000006": {
      "clade_membership": "3C.2a"
    },
    "NODE_0000007": {
      "clade_membership": "3C.2a"
    },
    "NODE_0000008": {
      "clade_membership": "3C.2a"
    },
    "NODE_0000010": {
      "clade_membership": "3C.2a"
    },
    "NODE_0000011": {
      "clade_membership": "3C.2a"
    },
    "NODE_0000012": {
      "clade_membership": "3C.2a"
    },
    "NODE_0000013": {
      "clade_membership": "3C.2a"
    },
    "NODE_0000016": {
      "clade_membership": "3C.2a"
    },
    "NODE_0000017": {
      "clade_membership": "3C.2a"
    },
    "NODE_0000018": {
      "clade_membership": "unassigned"
    },
    "NODE_0000019": {
      "clade_membership": "unassigned"
    },
    "NODE_0000020": {
      "clade_membership": "unassigned"
    },
    "NODE_0000021": {
      "clade_membership": "unassigned"
    },
    "NODE_0000023": {
      "clade_membership": "unassigned"
    },
    "NODE_0000025": {
      "clade_membership": "unassigned"
    },
    "NODE_0000028": {
      "clade_membership": "unassigned"
    },
    "NODE_0000029": {
      "clade_membership": "unassigned"
    },
    "NODE_0000030": {
      "clade_membership": "unassigned"
    },
    "NODE_0000032": {
      "clade_membership": "unassigned"
    },
    "NODE_0000033": {
      "clade_membership": "unassigned"
    },
    "NODE_0000034": {
      "clade_membership": "unassigned"
    },
    "NODE_0000035": {
      "clade_membership": "unassigned"
    },
    "SRR11445892_A_HA_H3": {
      "clade_membership": "3C.2a"
    },
    "SRR11445940_A_HA_H3": {
      "clade_membership": "3C.2a1"
    },
    "SRR11445941_A_HA_H3": {
      "clade_membership": "3C.2a"
    },
    "SRR13443360_A_HA_H3": {
      "clade_membership": "unassigned"
    }
  }
}

Your environment: if running Nextstrain locally

Additional context

Add any other context about the problem here.

joverlee521 commented 1 year ago

Hi @cimendes,

This is expected behavior of augur clades when the node does not have the amino acid and nucleotide mutations that match your clade definitions.

I suspect you need to update the coordinates within clades_h3n2_ha.tsv. Currently, it is an exact copy of the H3N2 clades.tsv from the seasonal-flu repo, which was created based on the seasonal-flu repo's reference.fasta and genemap.gff.

If you look at the seasonal-flu's genemap.gff, it has different start/end coordinates than the coordinates listed for your reference in reference_h3n2_ha.gb.

joverlee521 commented 1 year ago

Also note that the --reference option is not a supported feature yet. You should have seen this warning when you tried to use this option.

Although it is unexpected that using the --reference option affected your output, that sounds like a bug that should be fixed!

jrotieno commented 1 year ago

Just coming back to this issue:

  1. The samples we have are older H3N2s (2009-2015), and are just for training purposes. We wanted a good study, with raw reads available and some metadata.
  2. Here is a sample HA sequence: 109342_HA.fasta.zip
  3. From an explanation by @corneliusroemer, these older sdequences should get the clade "unassigned", which is what happens when I use nextclade web version and with the reference "CY163680".
  4. However, when I use the reference "EPI1857216", I get a 3C clade for that sample, which should be incorrect as the original paper reports clade 7.
  5. Shouldn't both references give the same clade output, or in which cases should one be used over the other?
joverlee521 commented 1 year ago

Hi @jrotieno, the issue you are running into is slightly different. Nextclade uses a different algorithm for clade assignment that is separate from the augur clade command.

As noted in the Clade assignment section:

Nextclade assigns the clade of the nearest reference node found during the Phylogenetic placement step.

Since the two references use different reference trees, they could potentially assign different clades to the same sample.


in which cases should one be used over the other?

Others will definitely have more insight here, but older samples would require an older reference since they are aligned against the reference for mutation calling.