ncbi / vadr

Viral Annotation DefineR: classification and annotation of viral sequences based on RefSeq annotation
Other
99 stars 23 forks source link

Unable to build a model(NC_006273) #77

Closed voidbag closed 4 months ago

voidbag commented 6 months ago

I am building models for herpes virus(NC_006273)

3c6eca300124#  ./v-build.pl --forcelong NC_006273 NC_006273
# v-build.pl :: build homology model of a single sequence for feature annotation
# VADR 1.6.3 (Dec 2023)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Mon Apr 22 00:10:12 2024
# $VADRBLASTDIR:     /opt/vadr/ncbi-blast/bin
# $VADREASELDIR:     /opt/vadr/infernal/binaries
# $VADRINFERNALDIR:  /opt/vadr/infernal/binaries
# $VADRSCRIPTSDIR:   /opt/vadr/vadr
#
# accession/model name:                NC_006273
# output directory:                    NC_006273
# allow long models > 25Kb in length:  yes [--forcelong]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Fetching FASTA file                                          ... done. [    7.1 seconds]
# Parsing FASTA file                                           ... done. [    0.0 seconds]
# Fetching feature table file                                  ... done. [    7.5 seconds]
# Parsing feature table file                                   ... done. [    0.0 seconds]
# Fetching and parsing protein feature table file(s)           ... done. [  910.8 seconds]
# Pruning data read from GenBank                               ... done. [    0.0 seconds]
# Reformatting FASTA file to Stockholm file                    ... done. [    0.0 seconds]
# Building BLAST nucleotide database                           ... done. [    1.1 seconds]
# Finalizing feature information                               ... done. [    0.2 seconds]
# Translating CDS                                              ...
ERROR in sqf_EslTranslateCdsToFastaFile, problem translating CDS feature, unable to find expected translated sequence in NC_006273/NC_006273.vadr.cds.esl-translate.2.fa:
        seq: NC_006273.2/38133..37894:-
        expected sequence:source=NC_006273.2/38133..37894:-,coords=1..237

I suspect the contents in tbl file(NC_006273.vadr.tbl).

38133   37894   CDS
                        product protein UL30A
                        transl_except   (pos:complement(38131..38133),aa:Met)
                        protein_id      ref|YP_007947989.1|
                        note    UL30 family
                        exception       alternative start codon
                        experiment      EXISTENCE:ribosome profiling[23180859]

Could anyone help this issue?

For better handling, I uploaded .fa and .tbl files.

NC_006273.vadr.fa.txt NC_006273.vadr.tbl.txt

nawrockie commented 5 months ago

@voidbag : Apologies for my slow response. The reason for the failure is that there is a translation exception in the genbank record, and an alternative start codon is used for the UL30A gene's CDS. Relevant excerpt from the GenBank annotation:

CDS complement(37894..38133)
   /gene="UL30A"
   /exception="alternative start codon"
   /transl_except=(pos:complement(38131..38133),aa:Met)

The v-build.pl program validates all CDS sequences by checking that a corresponding ORF exists in the genome sequence, but it does not allow for alternative start codons like this.

The way around this is to modify the genome sequence to include a valid start, and supply that modified sequence as input to the v-build.pl program using the --infa option. This model will still work fine for annotation. I'm attaching a fasta file called (NC_006273.38132GtoA.fa.txt) which includes the NC_006273 sequence with a single nucleotide change of position 38132 from a G to an A.

You can use this with v-build.pl using the following command:

$VDIR/v-build.pl -f --skipbuild --infa NC_006273.38132GtoA.fa.txt --forcelong NC_006273 NC_006273

I recommend using the --skipbuild option also, which will prevent a CM file from being built. The NC_006273 genome, which is more than 200Kb, is too long to use a CM for anyway. When you go to annotate sequences using this model with v-annotate.pl I recommend you follow the instructions I've written for monkeypox virus (which is about 200Kb) here:

https://github.com/ncbi/vadr/wiki/Mpox-virus-annotation