Unable to build a model(NC_006273)

ncbi / vadr

Viral Annotation DefineR: classification and annotation of viral sequences based on RefSeq annotation

Other

99 stars 23 forks source link

3c6eca300124# ./v-build.pl --forcelong NC_006273 NC_006273 # v-build.pl :: build homology model of a single sequence for feature annotation # VADR 1.6.3 (Dec 2023) # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # date: Mon Apr 22 00:10:12 2024 # $VADRBLASTDIR: /opt/vadr/ncbi-blast/bin # $VADREASELDIR: /opt/vadr/infernal/binaries # $VADRINFERNALDIR: /opt/vadr/infernal/binaries # $VADRSCRIPTSDIR: /opt/vadr/vadr # # accession/model name: NC_006273 # output directory: NC_006273 # allow long models > 25Kb in length: yes [--forcelong] # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Fetching FASTA file ... done. [ 7.1 seconds] # Parsing FASTA file ... done. [ 0.0 seconds] # Fetching feature table file ... done. [ 7.5 seconds] # Parsing feature table file ... done. [ 0.0 seconds] # Fetching and parsing protein feature table file(s) ... done. [ 910.8 seconds] # Pruning data read from GenBank ... done. [ 0.0 seconds] # Reformatting FASTA file to Stockholm file ... done. [ 0.0 seconds] # Building BLAST nucleotide database ... done. [ 1.1 seconds] # Finalizing feature information ... done. [ 0.2 seconds] # Translating CDS ... ERROR in sqf_EslTranslateCdsToFastaFile, problem translating CDS feature, unable to find expected translated sequence in NC_006273/NC_006273.vadr.cds.esl-translate.2.fa: seq: NC_006273.2/38133..37894:- expected sequence:source=NC_006273.2/38133..37894:-,coords=1..237

38133 37894 CDS product protein UL30A transl_except (pos:complement(38131..38133),aa:Met) protein_id ref|YP_007947989.1| note UL30 family exception alternative start codon experiment EXISTENCE:ribosome profiling[23180859]

@voidbag : Apologies for my slow response. The reason for the failure is that there is a translation exception in the genbank record, and an alternative start codon is used for the UL30A gene's CDS. Relevant excerpt from the GenBank annotation:

CDS complement(37894..38133)
   /gene="UL30A"
   /exception="alternative start codon"
   /transl_except=(pos:complement(38131..38133),aa:Met)

The v-build.pl program validates all CDS sequences by checking that a corresponding ORF exists in the genome sequence, but it does not allow for alternative start codons like this.

The way around this is to modify the genome sequence to include a valid start, and supply that modified sequence as input to the v-build.pl program using the --infa option. This model will still work fine for annotation. I'm attaching a fasta file called (NC_006273.38132GtoA.fa.txt) which includes the NC_006273 sequence with a single nucleotide change of position 38132 from a G to an A.

You can use this with v-build.pl using the following command:

$VDIR/v-build.pl -f --skipbuild --infa NC_006273.38132GtoA.fa.txt --forcelong NC_006273 NC_006273

I recommend using the --skipbuild option also, which will prevent a CM file from being built. The NC_006273 genome, which is more than 200Kb, is too long to use a CM for anyway. When you go to annotate sequences using this model with v-annotate.pl I recommend you follow the instructions I've written for monkeypox virus (which is about 200Kb) here:

https://github.com/ncbi/vadr/wiki/Mpox-virus-annotation

ncbi / vadr

Unable to build a model(NC_006273) #77