nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

`augur translate` produces genome annotations that fail validation in `augur export` #1205

Open joverlee521 opened 1 year ago

joverlee521 commented 1 year ago

Current Behavior

If the reference sequence provided to the augur translate command has invalid characters in a gene name (e.g. spaces), this will eventually lead to an error during augur export v2 validation.

The error message from augur export v2 is not super informative:

Validating schema of 'auspice/zika.json'...
  .meta.genome_annotations {"nuc": {"end": 10769, "start": 1, "strand": "+"…} failed additionalProperties validation for false
  .tree {"name": "NODE_0000000", "node_attrs": {"div": 0…} failed oneOf validation for [{"$ref": "#/$defs/tree"}, {"type": "array", "minItems": 1, "items": {"$ref": "#/$defs/tree"}}]
    validation for arm 0: {"$ref": "#/$defs/tree"}
      .tree.children[…].branch_attrs.mutations {"nuc": ["T329C", "C1209G"], "Capsid Protein": […} failed additionalProperties validation for false
      .tree.children[…].branch_attrs.mutations {"nuc": ["G318T", "G438T", "C1233T", "C1416T", "…} failed additionalProperties validation for false
      .tree.children[…].branch_attrs.mutations {"nuc": ["G406A"], "Capsid Protein": ["A106T"]} failed additionalProperties validation for false
      .tree.children[…].branch_attrs.mutations {"nuc": ["T329C", "T762C", "G1170T", "G1458A", "…} failed additionalProperties validation for false
      .tree.children[…].branch_attrs.mutations {"nuc": ["A3C", "T411A", "T738C", "C858T", "G864…} failed additionalProperties validation for false
      .tree.children[…].branch_attrs.mutations {"nuc": ["T249C", "G416A", "C789T", "T2032C", "T…} failed additionalProperties validation for false
    validation for arm 1: {"type": "array", "minItems": 1, "items": {"$ref": "#/$defs/tree"}}
      .tree {"name": "NODE_0000000", "node_attrs": {"div": 0…} failed type validation for "array"
Validation of 'auspice/zika.json' failed.

------------------------
Validation of auspice/zika.json failed. Please check this in a local instance of `auspice`, as it is not expected to display correctly. 
------------------------

Expected behavior

The aa-muts.json file produced from augur translate should be valid for augur export v2.

How to reproduce

Steps to reproduce the current behavior:

  1. Add a space in a gene name for the zika tutorial reference.
  2. Run the zika tutorial build
  3. See error in final export step.

Possible solution

Additional context

First saw this issue during Nextstrain office hours on 2023-04-27.

jameshadfield commented 1 year ago

Gene names really shouldn't have spaces in them as per general guidelines,

Symbols contain only uppercase Latin letters and Arabic numerals, and punctuation is avoided, with an exception for hyphens in specific groups,

but Auspice can display them and so I don't see a problem relaxing the schema. We should strongly recommend that short names without spaces are best, as Auspice will only display these when there is enough space available to draw them on top of the rendered CDS. We will shortly have the ability to export display name and/or description (for each gene/CDS) which may help with this.

jameshadfield commented 1 year ago

See also https://github.com/nextstrain/augur/pull/955