Different expected and conceptual translations for some partial CDS

tpillone commented 1 month ago

Hi,

I'm trying to submit embl flatfiles generated by bakta to the EBI but I am confronted with multiple errors related to the translation of some partial CDS. It seems that some partial CDS have unexpected translated sequences. The message raise by the webin validation tool is:

ERROR: Expected and conceptual translations are different. [ line: 167841 of 4_GEN12896.embl.gz]

If I translate those CDS with biopython, I systematically get one extra amino acid as compared to the translation reported by bakta. Here are few examples (from this embl file):

contig_113 <0:1356 biopython LRNVESDVDALFAKTDITIGQSLTLLNNEITKFVGEAGKGSGAAQVLAGSVQTLASNLDLIADGALVVGIGYITRAILMKSAAIKEGMASTLASRQASVLNAQAEYAEATAALNAAKAHLANVRATNAETQAKFGATAAATRYAQAQAAVTAATNAQTAAQIKLNTATSIAGRLAKGAFGLIGGWAGVATLGVMGLAAAYSYFNNKAEGAKQKLAEQAKVAEKADEELKKLTGNDKAKAVNDLTTAFNAQNKALEKSSRAVGSALIDIENYARGNREVEKISQEARTGTISYTEAIERLNKIKLPTDLYENLKKQAAQYDDNASKASLSAEKLKLLRVEVKLGGNEAQNAAIQHQKQADALGNTATEAEKATKALQDYQAKQKDSVIDSIYKSGWLDKGYTVAQANAILELQKAKGMSAILSKDEIDSALRNLKIIEEQQEREDKLTEAKRK bakta LRNVESDVDALFAKTDITIGQSLTLLNNEITKFVGEAGKGSGAAQVLAGSVQTLASNLDLIADGALVVGIGYITRAILMKSAAIKEGMASTLASRQASVLNAQAEYAEATAALNAAKAHLANVRATNAETQAKFGATAAATRYAQAQAAVTAATNAQTAAQIKLNTATSIAGRLAKGAFGLIGGWAGVATLGVMGLAAAYSYFNNKAEGAKQKLAEQAKVAEKADEELKKLTGNDKAKAVNDLTTAFNAQNKALEKSSRAVGSALIDIENYARGNREVEKISQEARTGTISYTEAIERLNKIKLPTDLYENLKKQAAQYDDNASKASLSAEKLKLLRVEVKLGGNEAQNAAIQHQKQADALGNTATEAEKATKALQDYQAKQKDSVIDSIYKSGWLDKGYTVAQANAILELQKAKGMSAILSKDEIDSALRNLKIIEEQQEREDKLTEAKR contig_127 <0:831 biopython LQKQQELGLLKLAQEQRLFQAEQFMLGEMERIKKRYALEYDEISKITDLEERRRKMSAFQADFIRNGVGNPTIDQYDTSSQFLKSTNYTKPKQTNMQVLDEDYAQTYQKLKDNLAAVLESEKASYQERLEAERVFKEARQQMDNEYHLKAIDARKADHDSQLQLYSQMISSASSTWGGLTQIVKDARGENSRSFKAMFIAQQSFAIASAIISAHLAATQVAADATIPFFGAKIAASTAMLAMGYANAGLIAGQTIAGFSDGGFTGSGGKYQPAGIVH bakta LQKQQELGLLKLAQEQRLFQAEQFMLGEMERIKKRYALEYDEISKITDLEERRRKMSAFQADFIRNGVGNPTIDQYDTSSQFLKSTNYTKPKQTNMQVLDEDYAQTYQKLKDNLAAVLESEKASYQERLEAERVFKEARQQMDNEYHLKAIDARKADHDSQLQLYSQMISSASSTWGGLTQIVKDARGENSRSFKAMFIAQQSFAIASAIISAHLAATQVAADATIPFFGAKIAASTAMLAMGYANAGLIAGQTIAGFSDGGFTGSGGKYQPAGIV contig_129 <1:808 biopython TQEIEKQAKLTKRLVGISGQSGIGTGPHLDVRYGGSLSGQKVSNEHLARLQAGGKPLTSYKISSNYGPRKAPTKGASSFHKGIDFSMPEGTPITTNVAVKDIKTWYDSKGGGYVSEVIFEDGVSLKLLHQSPKMQSKVKGGASKGSDKAAGDIQSQLERQQDLQRSLENEVASEVGRINNNRKARLEDVDKANFSPERTAEIKAEINRRADNDIAIAKQALRTKLEDYKEFQKTEEQLLEESFNRKKFNAAHDLELSKFEQKQAVELLE bakta TQEIEKQAKLTKRLVGISGQSGIGTGPHLDVRYGGSLSGQKVSNEHLARLQAGGKPLTSYKISSNYGPRKAPTKGASSFHKGIDFSMPEGTPITTNVAVKDIKTWYDSKGGGYVSEVIFEDGVSLKLLHQSPKMQSKVKGGASKGSDKAAGDIQSQLERQQDLQRSLENEVASEVGRINNNRKARLEDVDKANFSPERTAEIKAEINRRADNDIAIAKQALRTKLEDYKEFQKTEEQLLEESFNRKKFNAAHDLELSKFEQKQAVELL contig_133 <2:638 biopython QNWGGIQADMNGTGEFFRQDQERFSRLNAANDLADSQFAATDLNEQNSLDGLNAQFEAGLIKQQDYENQKTAIIQAAQDQRNQIAAEYAQNAQDIEDKYQQDRLNTIIAFGGNMMGSLTSMFGSMFGEQSKAYKIMFAADKAYAIAAAGIAIQQNIAAASKVGFPLNLPLIAGAVAQGASIIANIRAIKDQGFAEGGYTGRGGKYEVAGAVH bakta QNWGGIQADMNGTGEFFRQDQERFSRLNAANDLADSQFAATDLNEQNSLDGLNAQFEAGLIKQQDYENQKTAIIQAAQDQRNQIAAEYAQNAQDIEDKYQQDRLNTIIAFGGNMMGSLTSMFGSMFGEQSKAYKIMFAADKAYAIAAAGIAIQQNIAAASKVGFPLNLPLIAGAVAQGASIIANIRAIKDQGFAEGGYTGRGGKYEVAGAV

Biopython code to get the translation:

trans = feature.translate(record.seq, cds=False, stop_symbol="")

This genome was annotated with bakta 1.9.4 with but I had similar issues with older versions.
They are many other partial CDS that do not exhibit this problem

Any idea what might be causing this problem?

Best, Trestan

oschwengers commented 1 month ago

Hi Trestan, thanks for reporting. This looks both odd and interesting, and I'd like to take a deeper look at this. Could you please run this setting the --debug flag (if not set already) and provide also the log and json file?

tpillone commented 1 month ago

Here are the log and json files in debug mode:

GEN12896.json GEN12896.log

Thank you, Trestan

oschwengers commented 3 weeks ago

OK, I took a deeper look into this. Indeed, for partial CDS that are truncated at both sides, the trailing amino acids is accidentally clipped as this is perceived as a trailing * which normally encodes a stop codon. Of course, in these cases, there is no such stop codon which causes the too short aa sequences.

I think this is a crucial bug, but occurs only in such rare edge cases that for most applications might not be that relevant (at least I hope so), as most analyses probably do not take into account annotated protein sequences that are truncated at both sides, anyway.

I committed a fix which will be available soon with the upcoming version 1.10.0.

Again, thank you very much for reporting this! I'll close this for now, but please do not hesitate to re-open it if you have any further questions or feedback.

tpillone commented 2 weeks ago

Hi @oschwengers Thank you for the fix! Looking more closely at those genomes, I finally understood why this rare edge case happened several times. It turns out that I re-annotated fasta files previously annotated with an older version of Bakta with a wrong --complete flag. Bakta fasta headers had the following format: >contig_1 [completeness=complete] [topology=linear] [gcode=11] As Bakta considers sequences as circular when the keyword "complete" is detected in the fasta header (https://github.com/oschwengers/bakta/blob/518253abc7afe0fbd8a19051256f4181801d33dd/bakta/utils.py#L406-L408), all contigs were considered as circular, leading to the identification or a lot of CDS spanning contigs ends, including CDS truncated on both ends. The origin of that specific problem was the wrong "`--complete" flag, but chromosomes and plasmids can be linear, is it really safe to try to guess the topology from the content of the header? Best

oschwengers / bakta

Different expected and conceptual translations for some partial CDS #340