Closed tvtv195 closed 1 month ago
Hi Chris, and thanks for reaching out. Based on the command line above, I guess you're working on a MAG. Depending on the species it could simply be the case that there are only few to no genes similar to those stored in KEGG. Could you provide some information about how many UniRef90-annotated genes and how many hypothetical
s you get ?
Just wanted to add that not all the KEGG annotations are included in the database (see the line below), which could also explain the discrepancy between the results and your expectations @tvtv195
Best,
Thanks, that's good to know. We have found a workaround, i.e. we run the gene calling through bakta and then submit the output to KEGG BlastKEGG Orthology Ank Links Annotation (BlastKOALA) (https://www.kegg.jp/blastkoala/) This way we went from, e.g. 1 KEGG annotation (bakta) to >100 (BlastKOALA) - our downstream checks (gene neighborhood and operon analysis, blast, phylogeny, etc.) confirmed the BlastKOALA annotations.
Thanks @cpauvert for bringing this up - 100% correct!
The Bakta database integrates annotation information from various external databases, trying to rank external databases in a way that larger more comprising DBs come first and smaller, often more specific, higher quality, databases come later. By this, we try to exploit the potentially more specific, higher quality information from smaller DBs. However, it is far from trivial to formalize these rankings. Hence, I decided to only take into account the upper 90% of all BlastKOALA annotations, as @cpauvert mentioned.
I hope, that over time, more and more additional information, like dbxrefs, ECs, etc, will make it into the Bakta annotation database. Needless to mention, that we're always open to and thankful for any ideas and feedback how to improve these things.
Dear bakta team, We ran bakta on several bacterial genomes, with >50% estimated compl. and <10% contam., to obtain KEGG annotations. Bakta ran just fine, without any issues, e.g.:
bakta --db /gpfs/gpfs1/scratch/cb761220/databases/bakta_db_2024/db \ -o /scratch/cb761203/02.analysis/13.module13/01.bakta/ -v \ /scratch/cb761203/02.analysis/12.module12/bins_hq/sample.bin.35.fa
However, we only got between 0 and 17 KEGG annotations (K0 ID) per genome. For example, a very small bacterial genome of 531,276 bp had 517 CDS but not even a single KEGG annotation.
Is there something wrong with the way bakta assigns KEGG annotations? Best, Chris