oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
431 stars 53 forks source link

Error on GenBank file with CompoundLocation overlapping sequence origin #324

Open aekazakov opened 4 days ago

aekazakov commented 4 days ago

Hi! Thank you for the great tool!

On a genome that has a coding gene overlapping circular sequence origin, Bakta terminates with error.

I run Bakta version 1.9.4 installed with conda, database version 5.1.

The command executed is: bakta --debug --db /mnt/data/ref/Bakta/v5.1/db --output test_bakta --prefix NZ_CP0128315.bakta --threads 8 --regions sequence.gb sequence.fasta

Input files sequence.gb and sequence.fasta for the NCBI sequence NZ_CP012831.1 were downloaded from https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP012831.1 and contain full Genbank record and FASTA-formatted nucleotide sequence.

This sequence has a gene and CDS features overlapping sequence origin:

     gene            complement(join(7118734..7119102,1..405))
                     /locus_tag="AO356_RS00005"
                     /old_locus_tag="AO356_00005"
     CDS             complement(join(7118734..7119102,1..405))
                     /locus_tag="AO356_RS00005"
                     /old_locus_tag="AO356_00005"
                     /EC_number="[1.1.1.14](https://enzyme.expasy.org/EC/1.1.1.14)"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:WP_004415564.1"
                     /GO_function="[GO:0016491 - oxidoreductase activity](http://amigo.geneontology.org/amigo/term/GO:0016491)
                     [Evidence IEA]"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=[11](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c#SG11)
                     /product="L-iditol 2-dehydrogenase"
                     /protein_id="[WP_013692788.1](https://www.ncbi.nlm.nih.gov/protein/503458127)"
                     /translation="MKRLEGKSALITGSARGIGRSFAQAYIGEGATVAIADINLERAQ
                     ATASELGPQAYAVEMDVTRQASIDAAIAEVVAHAGKLDILVNNAALFDLAPITEITRE
                     SYDKLFAINVSGTLFTLQAAAKQMISQGHGGKIINMASQAGRRGEALVGVYCATKAAV
                     ISLTQSAGLNLIKHKINVNAIAPGVVDGEHWDGVDALFAKHENRPLGEKKRLVGLEVP
                     YGRMGTAEDLVGMAIFLAGSESDYIVAQTYNVDGGNWMS"

Error message from NZ_CP0128315.bakta.log is:


11:32:33.190 - DEBUG - MAIN - revise translational exceptions
11:32:33.202 - DEBUG - MAIN - import user-provided CDS regions
11:32:34.399 - ERROR - CDS - user-provided CDS: CDS could not be translated into a valid amino acid sequence! contig=contig_1, start=1, stop=7119102, cds=ATCAACATGG...[7119092 symbols deleted]
11:32:34.428 - ERROR - CDS - user-provided CDS: regions/features file GenBank format not valid!
Traceback (most recent call last):
  File "/home/aekazakov/tools/anaconda3/envs/bakta/lib/python3.10/site-packages/bakta/features/cds.py", line 290, in import_user_cdss
    aa = str(Seq(nt).translate(table=cfg.translation_table, cds=True))
  File "/home/aekazakov/tools/anaconda3/envs/bakta/lib/python3.10/site-packages/Bio/Seq.py", line 1620, in translate
    _translate_str(str(self), table, stop_symbol, to_stop, cds, gap=gap)
  File "/home/aekazakov/tools/anaconda3/envs/bakta/lib/python3.10/site-packages/Bio/Seq.py", line 2871, in _translate_str
    raise CodonTable.TranslationError(
Bio.Data.CodonTable.TranslationError: Final codon 'ATC' is not a stop codon

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/aekazakov/tools/anaconda3/envs/bakta/lib/python3.10/site-packages/bakta/features/cds.py", line 295, in import_user_cdss
    raise ValueError(f"User-provided CDS could not be translated into a valid amino acid sequence! contig={user_cds['contig']}, start={user_cds['start']}, stop={user_cds['stop']}, cds={nt}")
ValueError: User-provided CDS could not be translated into a valid amino acid sequence! contig=contig_1, start=1, stop=7119102, cds=ATCAACATGG...[7119092 symbols deleted]

This error occurs because the CDS sequence extracted by extract_feature_sequence (bakta/utils.py) is wrong. It is 7.1 Mbp long instead of 774 bp.

oschwengers commented 3 days ago

Hi @aekazakov and thanks for the detailed report! I'll have a look at it and try to include this into the upcoming v1.10.0 release.