ncbi / vadr

Viral Annotation DefineR: classification and annotation of viral sequences based on RefSeq annotation
Other
93 stars 22 forks source link

VADR produces qualifier with invalid value #53

Open taltman opened 2 years ago

taltman commented 2 years ago

VADR produces features with the /exception qualifier, to specify ribosomal slippage:

>Feature NODE_1_length_19663_cov_257.252269
<1      12777   gene
                        gene    ORF1ab
<1      5687    CDS
5687    12777
                        product ORF1ab polyprotein
                        exception       ribosomal slippage
                        codon_start     3
                        protein_id      NODE_1_length_19663_cov_257.252269_1

But according to the INSDC specs, this is an invalid value:

https://www.insdc.org/documents/feature_table.html#7.3

Qualifier       /exception=
Definition      indicates that the coding region cannot be translated using
                standard biological rules
...
                - must not be used for ribosomal slippage, instead use join operator, 
                  e.g.: CDS   join(486..1784,1787..4810)
                              /note="ribosomal slip on tttt sequence at 1784..1787"

This causes problems when trying to submit genomes annotated with VADR to ENA.

taltman commented 2 years ago

This is what is desired:

Qualifier       /ribosomal_slippage
Definition      during protein translation, certain sequences can program
                ribosomes to change to an alternative reading frame by a 
                mechanism known as ribosomal slippage 
Value format    none 
Example         /ribosomal_slippage 
Comment         a join operator,e.g.: [join(486..1784,1787..4810)] should be used 
                in the CDS spans to indicate the location of ribosomal_slippage 
nawrockie commented 2 years ago

That's actually a different format than the .tbl file that vadr creates, despite them both being (confusingly) called feature tables. The format of vadr output 'feature tables' is described here: https://www.ncbi.nlm.nih.gov/genbank/feature_table/

The vadr format is a useful file format for submissions to GenBank. There may be ways to convert it to a format that ENA accepts for submissions, but I'm not sure what those formats are. The vadr .ftr output files may also be relatively easy to parse and reformat into an accepted ENA format.

taltman commented 2 years ago

Hi @nawrockie ,

I've so far used VADR to submit nine CoV genomes to ENA with remote homology to SARS-CoV-2. The latest version of DARTH will have the scripts to turn VADR output into a format that can be submitted to ENA using their Webin-CLI tool.

I looked at the GenBank feature_table page, but it doesn't talk about ribosome slippage, and how to encode it correctly (the page seems to be more concerned about syntax). INSDC is a collaboration between GenBank, DDBJ, and EMBL, so why wouldn't the semantics defined for INSDC apply to the GenBank?

nawrockie commented 2 years ago

VADR outputs the .tbl file format because that was preferred by the GenBank indexers for viral submissions at the time of development. In the GenBank submission pipeline, the vadr .tbl format file is then converted to 'asn' format using tbl2asn which is used to input the data into the GenBank database.

The latest version of DARTH will have the scripts to turn VADR output into a format that can be submitted to ENA using their Webin-CLI tool.

That sounds good. Have you finished developing the format conversion tool/script, or do you still have a problem with ribosomal_slippage?