ncbi / vadr

Viral Annotation DefineR: classification and annotation of viral sequences based on RefSeq annotation
Other
99 stars 23 forks source link

VADR predicted nested genes, prevents submission to ENA #54

Open taltman opened 2 years ago

taltman commented 2 years ago

This seemed to anger the validation guards at ENA:

19094   20750   gene
                        gene    N
19094   20750   CDS
                        product nucleocapsid phosphoprotein
                        protein_id      NODE_1_length_10623_cov_925.238_7
19115   19838   gene
                        gene    N2
19115   19838   CDS
                        product nucleocapsid phosphoprotein 2
                        protein_id      NODE_1_length_10623_cov_925.238_8

Is this desired behavior by VADR?

nawrockie commented 2 years ago

What was the issue exactly? The protein_id values? If so there's a --noprotid option that will get rid of them. If it's not that let me know what the problem is, there may be a way around it.

nawrockie commented 2 years ago

Ah, I see from the title of the issue the problem is that they are nested. Can you send me the .minfo file used with v-annotate.pl?

taltman commented 2 years ago

Hi @nawrockie , I'm using the pan-Coronavirus model,
version 1.3:

Please let me know if I misunderstood what you were asking for. Thanks!

nawrockie commented 2 years ago

It looks like the best matching model for your sequence must be the NC_006577 model because that is the only model with a N2 gene. The NC_006577 RefSeq has N2 nested within N as shown in the .minfo file, so that's why vadr is annotating it in your sequence:

FEATURE NC_006577 type:"gene" coords:"28320..29645:+" parent_idx_str:"GBNULL" gene:"N"
FEATURE NC_006577 type:"CDS" coords:"28320..29645:+" parent_idx_str:"GBNULL" gene:"N" product:"nucleocapsid phosphoprotein"
FEATURE NC_006577 type:"gene" coords:"28342..28959:+" parent_idx_str:"GBNULL" gene:"N2"
FEATURE NC_006577 type:"CDS" coords:"28342..28959:+" parent_idx_str:"GBNULL" gene:"N2" product:"nucleocapsid phosphoprotein 2"

If nested CDS and gene features are not allowed by ENA for submission purposes, you can just remove the N2 annotations manually from your .tbl file, or you can make a new .minfo file for vadr that has N2 removed and use that to redo the annotation, whichever is easier.

Let me know if that addresses your question or not.