Closed Dx-wmc closed 1 month ago
Also, another point of confusion for me is, is it not possible to use a gbk file with -- region parameter for different genomes, I have now manually annotated a gbk file. But when I want to apply it to another genome the --region parameter reports an error.
Hi, and sorry for responding late.
First: To take a deeper look into the first issue: Could you provide me with an example of this? genome in Fasta and both the raw/fixed gbk files?
Second: I'm not sure if I understand correctly what you're trying to achieve. You would like to re-use a Genbank file via --region
for another genome?
Hi, I have prepared the example data as requested:
s5.fasta
: The genome sequence annotated by PGAP pipeline.s5.gbk
: The raw GenBank file generated by PGAP for s5.fasta
contains genes not multiple of 3 in length.s5_edited.gbk
: The manually modified GenBank file where I removed the problematic genes so that Bakta's --region
parameter works correctly. Directly using --region s5.gbk
results in errors.1011.fasta
: Another genome sequence for which I would like to reuse the annotations from s5_edited.gbk
. However, this fails unless I manually change the first line of s5_edited.gbk
from s5
to 1011
, which is cumbersome.Based on these observations, I would like to propose the following enhancements to Bakta:
--region
parameter.I have attached the example data for your review. test.tar.gz
Hi @Dx-wmc,
I just pushed a fix to skip a couple of cases where a CDS does not comprise of even triplets, i.e. pseudogenes, partial genes and programmed frameshifts. This addresses and should fix 1.
Regarding 2.
: Either I still don't quite get this use case, or I'm not convinced that this is a good idea. In which cases can we just map the predicted CDS regions of genome A to genome B?
I agree with you that my previous considerations were somewhat lacking (especially in the case of large genomic differences). However, my intention was mainly to apply carefully prepared annotations (e.g., that I had manually modified) to other genomes with high similarity in the case of genomes with high similarity, which would prevent me from manually modifying them one by one.
OK, I see your totally valid point and this is certainly something that we should somehow try to address. However, we need to separate two distinct approaches:
For 1.
we can use the --region
feature which requires exact gene locations for each genome. These can either be provided by manually providing these (which is very tedious), or by an automated process. Currently, we hava a student in our lab working on a helper script trying to automate this.
For 2.
you can already use the --protein
feature to prioritze your manually curated gene symbols, protein descriptions, etc as described here
The second point you mentioned --protein I have tried, it has certain limitations, that is, it only works for the cds predicted by bakta, so I think the first point is what I want to achieve. I suggest that this function can be applied to shorter sequence comparisons first (for example, regions within 20k, which seems more conducive to comparative genomic analysis).
I tried another software, lifton(https://github.com/Kuanhao-Chao/LiftOn), which fits my idea of being able to do migration from annotation of one genome to another, but it is designed for eukaryotes, is it possible to borrow ideas from it to apply to bakta?
Sorry for the late reply.
The second point you mentioned --protein I have tried, it has certain limitations, that is, it only works for the cds predicted by bakta
Actually, user-provided protein annotations via --proteins
should be applied to user-provided CDS regions via --region
, as well.
I tried another software, lifton(https://github.com/Kuanhao-Chao/LiftOn), which fits my idea of being able to do migration from annotation of one genome to another, but it is designed for eukaryotes, is it possible to borrow ideas from it to apply to bakta?
Yes, in principle, one could adopt these things. However, Bakta was designed for automated de-novo annotations. Hence, I'm a bit reluctant to add all the implied additional huge complexity. In some cases, it might be better to use these external annotation lifting tools.
I guess, the original topic of this issue is addressed and solved. Hence, I'd close this for now. Otherwise, please do not hesitate to re-open this one, or file a new one regarding the lifting of annotations (just to keep issues lean and more understandable). Thanks!
While using Bakta, I realized that one of the known genes was not predicted. However, both Prokka and PGAP accurately predicted the gene. I tried using the -region option for this. Although Prokka's gbk file can be read well, PGAP's file is not ideal. Upon examining PGAP's gbk generation, I found that some of its gene fragments are not multiples of 3 in length (these genes are labeled as pseudogenes). After removing these non-3 genes, it ran successfully. However, manually modifying these files is too costly when applying PGAP annotations to Bakta in bulk. Therefore, I would like --region to automatically identify and exclude or skip these problematic genes, significantly improving efficiency.