nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

Validate annotations produced from ancestral + translate #951

Open corneliusroemer opened 2 years ago

corneliusroemer commented 2 years ago

I've encountered a bug that took me very long to figure out. Augur export reported the following error:

Validating schema of 'auspice/monkeypox_global.json'...
        ERROR: 'nuc' is a required property. Trace: properties - meta - properties - genome_annotations - required
Validation of 'auspice/monkeypox_global.json' failed.

------------------------
Validation of auspice/monkeypox_global.json failed. Please check this in a local instance of `auspice`, as it is not expected to display correctly. 
------------------------

Now it turns out, that export requires nuc annotations, and these come in usually through aa_mut.json from augur translate.

I was reading in annotations from a .gff into translate, something that's theoretically supported. However, it's actually not possible to read in nuc annotation in the current implementation.

It would have very much sped up debugging if augur translate had warned me (or even errored) when it realised that it was lacking nuc annotations.

I'd propose an error if nuc not output into aa_mut.json:

[Error] Could not read in `nuc` annotations. Please check the annotation in your input file. For `.gff` the line needs to look like this:
MT903344.1  Genbank source  1   197233  .   +   .   locus_tag=nuc

Related to #881

huddlej commented 2 years ago

I think this issue arose as part of this Slack conversation. @corneliusroemer, am I correct in this?

jameshadfield commented 1 year ago

(1 year later...)

The annotations schema now requires 'nuc' to be present (d6246ca052478446f7179e230e842a34f93e4cd4) however neither augur ancestral nor augur translate validate their outputs. Reading any node-data file (via NodeDataReader) with an "annotations" block will also validate against the schema, although in this case that's still going to be first encountered in augur export v2.

Conceptually we could have the annotations from ancestral define 'nuc' and translate define the CDSs, and they'll be merged in augur export, however I think it's sensible to require translate to add a 'nuc' block, which is why I made it a required property. If augur export sees multiple annotations.nuc entries it should really ensure they are the same length! (The JSON merging happens within NodeDataReader)

mazeller commented 10 months ago

Just a note, I ran into this issue working on my PRRSV dataset (https://github.com/mazeller/NextClade_Datasets/tree/main/prrsv_yimim_v3). I needed to append the following line to my GFF manually.

DQ478308.1 Genbank source 1 603 . + . locus_tag=nuc

jameshadfield commented 10 months ago

however I think it's sensible to require translate to add a 'nuc' block, which is why I made it a required property

As of 1d17699e960d3805a0a586d7ccf3e9a550d53ac9 (in master, but not yet released) augur translate will always export this. (I missed this issue when scanning, it's very similar to #953.)

Just a note, I ran into this issue working on my PRRSV dataset (https://github.com/mazeller/NextClade_Datasets/tree/main/prrsv_yimim_v3). I needed to append the following line to my GFF manually.

P.S. recent augur PRs (merged but not released) will fix this, we'll now read the nuc coords from the sequence-region pragma in your GFF ("##sequence-region DQ478308.1 1 603").