oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
448 stars 55 forks source link

Error on GenBank file with CompoundLocation overlapping sequence origin #324

Closed aekazakov closed 1 month ago

aekazakov commented 1 month ago

Hi! Thank you for the great tool!

On a genome that has a coding gene overlapping circular sequence origin, Bakta terminates with error.

I run Bakta version 1.9.4 installed with conda, database version 5.1.

The command executed is: bakta --debug --db /mnt/data/ref/Bakta/v5.1/db --output test_bakta --prefix NZ_CP0128315.bakta --threads 8 --regions sequence.gb sequence.fasta

Input files sequence.gb and sequence.fasta for the NCBI sequence NZ_CP012831.1 were downloaded from https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP012831.1 and contain full Genbank record and FASTA-formatted nucleotide sequence.

This sequence has a gene and CDS features overlapping sequence origin:

     gene            complement(join(7118734..7119102,1..405))
                     /locus_tag="AO356_RS00005"
                     /old_locus_tag="AO356_00005"
     CDS             complement(join(7118734..7119102,1..405))
                     /locus_tag="AO356_RS00005"
                     /old_locus_tag="AO356_00005"
                     /EC_number="[1.1.1.14](https://enzyme.expasy.org/EC/1.1.1.14)"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:WP_004415564.1"
                     /GO_function="[GO:0016491 - oxidoreductase activity](http://amigo.geneontology.org/amigo/term/GO:0016491)
                     [Evidence IEA]"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=[11](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c#SG11)
                     /product="L-iditol 2-dehydrogenase"
                     /protein_id="[WP_013692788.1](https://www.ncbi.nlm.nih.gov/protein/503458127)"
                     /translation="MKRLEGKSALITGSARGIGRSFAQAYIGEGATVAIADINLERAQ
                     ATASELGPQAYAVEMDVTRQASIDAAIAEVVAHAGKLDILVNNAALFDLAPITEITRE
                     SYDKLFAINVSGTLFTLQAAAKQMISQGHGGKIINMASQAGRRGEALVGVYCATKAAV
                     ISLTQSAGLNLIKHKINVNAIAPGVVDGEHWDGVDALFAKHENRPLGEKKRLVGLEVP
                     YGRMGTAEDLVGMAIFLAGSESDYIVAQTYNVDGGNWMS"

Error message from NZ_CP0128315.bakta.log is:


11:32:33.190 - DEBUG - MAIN - revise translational exceptions
11:32:33.202 - DEBUG - MAIN - import user-provided CDS regions
11:32:34.399 - ERROR - CDS - user-provided CDS: CDS could not be translated into a valid amino acid sequence! contig=contig_1, start=1, stop=7119102, cds=ATCAACATGG...[7119092 symbols deleted]
11:32:34.428 - ERROR - CDS - user-provided CDS: regions/features file GenBank format not valid!
Traceback (most recent call last):
  File "/home/aekazakov/tools/anaconda3/envs/bakta/lib/python3.10/site-packages/bakta/features/cds.py", line 290, in import_user_cdss
    aa = str(Seq(nt).translate(table=cfg.translation_table, cds=True))
  File "/home/aekazakov/tools/anaconda3/envs/bakta/lib/python3.10/site-packages/Bio/Seq.py", line 1620, in translate
    _translate_str(str(self), table, stop_symbol, to_stop, cds, gap=gap)
  File "/home/aekazakov/tools/anaconda3/envs/bakta/lib/python3.10/site-packages/Bio/Seq.py", line 2871, in _translate_str
    raise CodonTable.TranslationError(
Bio.Data.CodonTable.TranslationError: Final codon 'ATC' is not a stop codon

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/aekazakov/tools/anaconda3/envs/bakta/lib/python3.10/site-packages/bakta/features/cds.py", line 295, in import_user_cdss
    raise ValueError(f"User-provided CDS could not be translated into a valid amino acid sequence! contig={user_cds['contig']}, start={user_cds['start']}, stop={user_cds['stop']}, cds={nt}")
ValueError: User-provided CDS could not be translated into a valid amino acid sequence! contig=contig_1, start=1, stop=7119102, cds=ATCAACATGG...[7119092 symbols deleted]

This error occurs because the CDS sequence extracted by extract_feature_sequence (bakta/utils.py) is wrong. It is 7.1 Mbp long instead of 774 bp.

oschwengers commented 1 month ago

Hi @aekazakov and thanks for the detailed report! I'll have a look at it and try to include this into the upcoming v1.10.0 release.

oschwengers commented 1 month ago

OK, this should be fixed by #332.

@aekazakov If you like, and feel comfortable with being a very-early-tester, you could checkout the fix-edge-user-proteins branch at https://github.com/oschwengers/bakta/tree/fix-edge-user-proteins and give it a try.

aekazakov commented 1 month ago

I tested the fix-edge-user-proteins branch with two RefSeq genomes (NZ_CP012831.1 and NZ_CP015511.1) that have gene and CDS features overlapping sequence origin on - and + strand, respectively. For both genomes, the GenBank files were imported without errors, and the genes overlapping origin were reported correctly in Bakta output files. Thank you for fixing this bug!

oschwengers commented 1 month ago

Thanks for testing and reporting back. With that, I'll close this. If there are any further things to discuss or issues arising from this, please do not hesitate to re-open this or a new one. Thanks.