ncbi / vadr

Viral Annotation DefineR: classification and annotation of viral sequences based on RefSeq annotation
Other
99 stars 23 forks source link

bug: vadr annotation erroneous coordinates/matchs for avian coronavirus #89

Open FTouzain opened 5 days ago

FTouzain commented 5 days ago

Thank you for your work.

When I run vadr 1.6.3 on this virus: https://www.ncbi.nlm.nih.gov/nuccore/KY933089 (fasta file)

I obtain nothing in the 'pass' tsv file. In the 'fail' tsv file, I get this:

>Feature consensus-from_reference_KY933089
435 20317   gene
            gene    1ab
435 12254   CDS
12254   20317
            product ORF1ab polyprotein
            product frameshift product
            protein_id  consensus-from_reference_KY933089_1
435 12284   CDS
            product ORF1a polyprotein
            protein_id  consensus-from_reference_KY933089_2
435 2453    mat_peptide
            product leader protein p87
            protein_id  consensus-from_reference_KY933089_1
2454    8762    mat_peptide
            product coronavirus nsp1 [HD1]
            product hydrophobic domain 1
            protein_id  consensus-from_reference_KY933089_1
8763    9683    mat_peptide
            product coronavirus nsp2 [3CL-Pro]
            protein_id  consensus-from_reference_KY933089_1
9684    10565   mat_peptide
            product coronavirus nsp3 [HD2]
            product hydrophobic domain 2
            protein_id  consensus-from_reference_KY933089_1
10566   10814   mat_peptide
            product coronavirus nsp4
            protein_id  consensus-from_reference_KY933089_1
10815   11444   mat_peptide
            product coronavirus nsp5
            protein_id  consensus-from_reference_KY933089_1
11445   11777   mat_peptide
            product coronavirus nsp6
            protein_id  consensus-from_reference_KY933089_1
11778   12212   mat_peptide
            product coronavirus nsp7 [GLF]
            product growth factor-like protein
            protein_id  consensus-from_reference_KY933089_1
12213   12254   mat_peptide
12254   15031
            product RNA-dependent RNA polymerase
            product coronavirus nsp9 [RdRp]
            protein_id  consensus-from_reference_KY933089_1
12213   12281   mat_peptide
            product coronavirus nsp8
            protein_id  consensus-from_reference_KY933089_2
15032   16831   mat_peptide
            product coronavirus nsp10
            product metal-binding NTPase/helicase
            protein_id  consensus-from_reference_KY933089_1
16832   18394   mat_peptide
            product coronavirus nsp11
            protein_id  consensus-from_reference_KY933089_1
18395   19408   mat_peptide
            product coronavirus nsp12
            protein_id  consensus-from_reference_KY933089_1
19409   20314   mat_peptide
            product coronavirus nsp13
            protein_id  consensus-from_reference_KY933089_1
20268   23765   gene
            gene    2
20268   23765   misc_feature
            note    similar to spike protein
23765   24462   gene
            gene    3
23765   23938   CDS
            product 3a protein
            protein_id  consensus-from_reference_KY933089_3
23938   24129   CDS
            product 3b protein
            protein_id  consensus-from_reference_KY933089_4
24113   24442   CDS
            product small virion-associated protein
            protein_id  consensus-from_reference_KY933089_5
24411   25091   gene
            gene    M
24411   25091   CDS
            product membrane protein
            protein_id  consensus-from_reference_KY933089_6
25455   25897   gene
            gene    5
25455   25652   CDS
            product 5a protein
            protein_id  consensus-from_reference_KY933089_7
25649   25897   CDS
            product 5b protein
            protein_id  consensus-from_reference_KY933089_8
25840   27068   gene
            gene    N
25840   27068   misc_feature
            note    similar to nucleocapsid protein

Additional note(s) to submitter:
ERROR: CDS_HAS_STOP_CODON: (CDS:spike protein) in-frame stop codon exists 5' of stop position predicted by homology to reference [TAA, shifted S:27,M:27]; seq-coords:23736..23738:+; mdl-coords:23827..23829:+; mdl:NC_001451;
ERROR: CDS_HAS_STOP_CODON: (CDS:spike protein) stop codon in protein-based alignment [-]; seq-coords:23736..23738:+; mdl-coords:23827..23829:+; mdl:NC_001451;
ERROR: POSSIBLE_FRAMESHIFT: (CDS:spike protein) possible frameshift in CDS (frame restored before end) [cause:delete,S:20527,M:20625(1); restore:insert,S:20536(1),M:20633; frame:1(3)1; length:260:(9):3229;]; seq-coords:20528..20536:+; mdl-coords:20625..20633:+; mdl:NC_001451;
ERROR: INDEFINITE_ANNOTATION_START: (CDS:spike protein) protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint [99>5]; seq-coords:20268..20366:+; mdl-coords:20368..20368:+; mdl:NC_001451;
ERROR: POSSIBLE_FRAMESHIFT: (CDS:nucleocapsid protein) possible frameshift in CDS (frame not restored before end) [cause:delete,S:27056,M:27090(1); frame:1(3); length:1217:(12);]; seq-coords:27057..27068:+; mdl-coords:27090..27102:+; mdl:NC_001451;
ERROR: INDEFINITE_ANNOTATION_END: (CDS:nucleocapsid protein) protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [11>8]; seq-coords:27058..27068:+; mdl-coords:27102..27102:+; mdl:NC_001451;
ERROR: MUTATION_AT_END: (CDS:nucleocapsid protein) expected stop codon could not be identified, first in-frame stop codon exists 3' of predicted stop position [TAA]; seq-coords:27070..27072:+; mdl-coords:27104..27106:+; mdl:NC_001451;
ERROR: UNEXPECTED_LENGTH: (CDS:nucleocapsid protein) length of complete coding (CDS or mat_peptide) feature is not a multiple of 3 [1229]; seq-coords:25840..27068:+; mdl-coords:25873..27102:+; mdl:NC_001451;

One of the problems is this:

15032   16831   mat_peptide
            product coronavirus nsp10
            product metal-binding NTPase/helicase
            protein_id  consensus-from_reference_KY933089_1

It point the reference genome I use (KY933089.1), but at these positions, it is not the nsp10 pointed by vadr but the nsp13 (in my case, I focus on the position 15707) as viewed here: https://www.ncbi.nlm.nih.gov/nuccore/KY933089.1?report=graph (on this page, nsp10 is located near 11700 12000 positions)

How can vadr report similarities to a genome with coordinates that do not correspond to the reference genome that is mentioned please? (I need vadr predictions because I use sometimes assemblies, not ref genomes)

Thank you in advance Fabrice

nawrockie commented 5 days ago

@FTouzain : I'd be happy to look into this, but it would be helpful if you can please provide some more information. If you provide your vadr model file and the v-annotate.pl command you used, I should be able to reproduce your results and look into it further. Or if you tell me what commands you ran to build your model, I can build it myself.

FTouzain commented 5 days ago

@nawrockie Oh yes, I am sorry, I forgot the most important. I use your classical corona model (25/05/2022) with the command:

v-annotate.pl  --mkey corona  --mdir /db/vadr_db/vadr-models-corona/   --glsearch   --split  -f  --cpu 4  consensus_on_KY933089.fasta    out_dir_vadr/

(in the realease note of the model, it is written: vadr-models-55-1.0.2-dev-5: [Apr 2020] for the model part) I hope I do not forget info, otherwise tell me. Thank you.

nawrockie commented 4 days ago

@FTouzain : I was able to reproduce the annotation of nsp10 from 15032 to 16831 by vadr for KY993089.1. The alignment of NC_001451 (the reference model VADR uses), which has nsp10 at positions 15132..16931 (https://www.ncbi.nlm.nih.gov/nuccore/NC_001451) and KY993089 clearly shows that vadr's annotation from 15032 to 16831 is correct. This alignment will be output from v-annotate.pl if you use the '--out_stk' option. The alignment file name will end with '.NC_001451.align.stk'. I'm not sure why the alignment graph you provided the link for shows that nsp10 is between 11700 and 12000, but in KY993089.1 it is clearly 15032..16831 as vadr reports.

I'm attaching a screenshot of blastx output I get for NC_001451/15132..16931 (fasta below) and KY993089 showing that the vadr coords are correct.

Also, based on the model file you are using, I think that you must not be using the latest version of vadr (1.6.4). When I ran vadr 1.6.4 using the latest coronavirus model library v1.3.3 https://ftp.ncbi.nlm.nih.gov/pub/nawrocki/vadr-models/coronaviridae/CURRENT/vadr-models-corona-1.3-3.tar.gz I found that there were less alerts output from v-annotate.pl for KY993089 than what you reported. Based on this, I think that upgrading to the latest versions will give you better results.

Screenshot 2024-11-21 at 11 04 25 AM
nawrockie commented 4 days ago
>NC_001451.1/15132-16931                                                                                                                                                                                                                    
TCTTGTGGCGTTTGTGTAGTTTGTAATAGTCAAACTATACTACGCTGCGGTAATTGTATT                                                                                                                                                                                
CGTAAACCGTTTTTGTGTTGTAAGTGTTGCTATGACCACGTCATGCATACGGACCACAAA                                                                                                                                                                                
AATGTTTTATCTATAAATCCTTATATTTGCTCACAGCTAGGTTGCGGTGAAGCAGATGTT                                                                                                                                                                                
ACTAAATTGTACCTCGGGGGTATGTCGTACTTCTGTGGTAATCATAAACCGAAATTGTCA                                                                                                                                                                                
ATACCGTTAGTATCTAATGGTACTGTTTTTGGAATTTACAGGGCTAATTGTGCTGGTAGT                                                                                                                                                                                
GAAAATGTTGATGATTTTAATCAACTAGCTACTACTAATTGGTCCATTGTCGAACCTTAT                                                                                                                                                                                
ATTTTAGCAAATCGCTGTAGTGATTCATTGAGACGTTTTGCTGCAGAGACAGTAAAAGCC                                                                                                                                                                                
ACAGAAGAATTACATAAGCAACAATTTGCTAGTGCAGAAGTGCGAGAAGTATTCTCAGAT                                                                                                                                                                                
CGTGAATTGATTCTATCATGGGAACCAGGAAAAACCAGGCCGCCATTGAATAGAAATTAT                                                                                                                                                                                
GTTTTCACAGGTTATCACTTTACAAGAACTAGTAAGGTGCAGCTTGGTGATTTTACATTT                                                                                                                                                                                
GAAAAAGGTGAAGGTAAGGATGTTGTCTATTATAAAGCAACGTCTACTGCTAAATTGTCT                                                                                                                                                                                
GTAGGAGACATTTTTGTTTTAACCTCACACAATGTTGTTTCTCTCGTAGCGCCAACATTG                                                                                                                                                                                
TGTCCACAACAAACCTTTTCTAGGTTTGTAAATTTAAGACCTAATGTAATGGTACCTGAA                                                                                                                                                                                
TGTTTTGTAAATAACATTCCACTTTACCATTTAGTAGGTAAACAGAAGCGTACTACAGTA                                                                                                                                                                                
CAAGGTCCTCCTGGCAGTGGTAAATCCCACTTTGCTATAGGCCTTGCAGTATACTTTAGT                                                                                                                                                                                
AGCGCTCGTGTTGTTTTTACTGCATGTTCTCATGCAGCTGTTGATGCTTTATGTGAAAAA                                                                                                                                                                                
GCTTTTAAGTTTCTTAAAGTTGATGATTGCACTCGTATAGTACCCCAAAGGACTACTGTC                                                                                                                                                                                
GATTGCTTCTCAAAATTTAAAGCTAATGACACAGGCAAAAAGTACATTTTTAGTACTATT                                                                                                                                                                                
AATGCCTTGCCGGAAGTTAGTTGTGATATTCTTTTGGTTGACGAGGTTAGTATGTTGACC                                                                                                                                                                                
AATTACGAATTGTCCTTTATTAATGGTAAGATAAATTACCAATATGTTGTGTATGTAGGT                                                                                                                                                                                
GATCCGGCTCAATTACCGGCACCCCGCACTTTACTTAATGGTTCACTTTCTCCAAAGGAT                                                                                                                                                                                
TATAATGTTGTCACAAACCTTATGGTTTGTGTTAAACCTGATATTTTCCTTGCAAAGTGT                                                                                                                                                                                
TATCGTTGTCCTAAGGAAATTGTAGACACTGTGTCTACTCTTGTTTATGATGGAAAGTTT                                                                                                                                                                                
ATTGCAAATAACCCAGAATCACGTGAGTGTTTCAAGGTTATAGTTAATAATGGCAATTCT                                                                                                                                                                                
GATGTAGGACATGAAAGTGGTTCAGCCTACAACACAACACAATTGGAATTTGTGAAAGAC                                                                                                                                                                                
TTTGTTTGTCGCAATAAACAATGGCGGGAAGCAATATTTATTTCACCTTACAATGCTATG                                                                                                                                                                                
AACCAGAGAGCTTACCGTATGCTTGGACTTAATGTTCAAACAGTAGATTCTTCTCAAGGT                                                                                                                                                                                
TCAGAGTATGATTATGTCATCTTCTGTGTTACTGCAGATTCGCAGCATGCACTGAATATT                                                                                                                                                                                
AATAGATTTAATGTGGCGCTTACAAGAGCTAAGCGTGGTATACTAGTTGTCATGCGCCAG                                                                                                                                                                                
CGTGATGAATTGTATTCTGCTCTTAAGTTTACAGAGCTAGATAGTGAAACAAGTCTGCAA 
nawrockie commented 4 days ago

@FTouzain : actually the best way to see that the vadr coords for nsp10 are correct is to run blastx with query of KY933089, then click 'Align two or more sequences' and enter NP_740630.1 as the subject (NP_740630.1 is the nsp10 protein db entry from NC_001451).

blastx server: https://blast.ncbi.nlm.nih.gov/Blast.cgi?LINK_LOC=blasthome&PAGE_TYPE=BlastSearch&PROGRAM=blastx