KEGG annotations - the number is much lower number than expected

oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids

GNU General Public License v3.0

432 stars 53 forks source link

KEGG annotations - the number is much lower number than expected #285

Closed tvtv195 closed 1 month ago

tvtv195 commented 5 months ago

Dear bakta team, We ran bakta on several bacterial genomes, with >50% estimated compl. and <10% contam., to obtain KEGG annotations. Bakta ran just fine, without any issues, e.g.:

bakta --db /gpfs/gpfs1/scratch/cb761220/databases/bakta_db_2024/db \ -o /scratch/cb761203/02.analysis/13.module13/01.bakta/ -v \ /scratch/cb761203/02.analysis/12.module12/bins_hq/sample.bin.35.fa

However, we only got between 0 and 17 KEGG annotations (K0 ID) per genome. For example, a very small bacterial genome of 531,276 bp had 517 CDS but not even a single KEGG annotation.

Is there something wrong with the way bakta assigns KEGG annotations? Best, Chris

oschwengers commented 5 months ago

Hi Chris, and thanks for reaching out. Based on the command line above, I guess you're working on a MAG. Depending on the species it could simply be the case that there are only few to no genes similar to those stored in KEGG. Could you provide some information about how many UniRef90-annotated genes and how many hypotheticals you get ?

cpauvert commented 2 months ago

Just wanted to add that not all the KEGG annotations are included in the database (see the line below), which could also explain the discrepancy between the results and your expectations @tvtv195

https://github.com/oschwengers/bakta/blob/c93c3f144282146b89e5d372f91f0e3cf60d968e/db-scripts/annotate-kofams.py#L58

Best,

tvtv195 commented 2 months ago

Thanks, that's good to know. We have found a workaround, i.e. we run the gene calling through bakta and then submit the output to KEGG BlastKEGG Orthology Ank Links Annotation (BlastKOALA) (https://www.kegg.jp/blastkoala/) This way we went from, e.g. 1 KEGG annotation (bakta) to >100 (BlastKOALA) - our downstream checks (gene neighborhood and operon analysis, blast, phylogeny, etc.) confirmed the BlastKOALA annotations.

oschwengers commented 2 months ago

Thanks @cpauvert for bringing this up - 100% correct!

The Bakta database integrates annotation information from various external databases, trying to rank external databases in a way that larger more comprising DBs come first and smaller, often more specific, higher quality, databases come later. By this, we try to exploit the potentially more specific, higher quality information from smaller DBs. However, it is far from trivial to formalize these rankings. Hence, I decided to only take into account the upper 90% of all BlastKOALA annotations, as @cpauvert mentioned.

I hope, that over time, more and more additional information, like dbxrefs, ECs, etc, will make it into the Bakta annotation database. Needless to mention, that we're always open to and thankful for any ideas and feedback how to improve these things.