Make the --region parameter better match the output of PGAP

Dx-wmc commented 6 months ago

While using Bakta, I realized that one of the known genes was not predicted. However, both Prokka and PGAP accurately predicted the gene. I tried using the -region option for this. Although Prokka's gbk file can be read well, PGAP's file is not ideal. Upon examining PGAP's gbk generation, I found that some of its gene fragments are not multiples of 3 in length (these genes are labeled as pseudogenes). After removing these non-3 genes, it ran successfully. However, manually modifying these files is too costly when applying PGAP annotations to Bakta in bulk. Therefore, I would like --region to automatically identify and exclude or skip these problematic genes, significantly improving efficiency.

Dx-wmc commented 6 months ago

Also, another point of confusion for me is, is it not possible to use a gbk file with -- region parameter for different genomes, I have now manually annotated a gbk file. But when I want to apply it to another genome the --region parameter reports an error.

oschwengers commented 4 months ago

Hi, and sorry for responding late.

First: To take a deeper look into the first issue: Could you provide me with an example of this? genome in Fasta and both the raw/fixed gbk files?

Second: I'm not sure if I understand correctly what you're trying to achieve. You would like to re-use a Genbank file via --region for another genome?

Dx-wmc commented 4 months ago

Hi, I have prepared the example data as requested:

s5.fasta: The genome sequence annotated by PGAP pipeline.
s5.gbk: The raw GenBank file generated by PGAP for s5.fasta contains genes not multiple of 3 in length.
s5_edited.gbk: The manually modified GenBank file where I removed the problematic genes so that Bakta's --region parameter works correctly. Directly using --region s5.gbk results in errors.
1011.fasta: Another genome sequence for which I would like to reuse the annotations from s5_edited.gbk. However, this fails unless I manually change the first line of s5_edited.gbk from s5 to 1011, which is cumbersome.

Based on these observations, I would like to propose the following enhancements to Bakta:

Automatically exclude or skip genes that are not multiples of 3 in length when using the --region parameter.
Reuse an annotated GenBank file for different genomes without manual modification of the file.

I have attached the example data for your review. test.tar.gz

oschwengers commented 4 months ago

Hi @Dx-wmc, I just pushed a fix to skip a couple of cases where a CDS does not comprise of even triplets, i.e. pseudogenes, partial genes and programmed frameshifts. This addresses and should fix 1.

Regarding 2.: Either I still don't quite get this use case, or I'm not convinced that this is a good idea. In which cases can we just map the predicted CDS regions of genome A to genome B?

Dx-wmc commented 4 months ago

I agree with you that my previous considerations were somewhat lacking (especially in the case of large genomic differences). However, my intention was mainly to apply carefully prepared annotations (e.g., that I had manually modified) to other genomes with high similarity in the case of genomes with high similarity, which would prevent me from manually modifying them one by one.

oschwengers commented 4 months ago

OK, I see your totally valid point and this is certainly something that we should somehow try to address. However, we need to separate two distinct approaches:

refining the structural gene prediction
refining the function gene annotation

For 1. we can use the --region feature which requires exact gene locations for each genome. These can either be provided by manually providing these (which is very tedious), or by an automated process. Currently, we hava a student in our lab working on a helper script trying to automate this.

For 2. you can already use the --protein feature to prioritze your manually curated gene symbols, protein descriptions, etc as described here

Dx-wmc commented 4 months ago

The second point you mentioned --protein I have tried, it has certain limitations, that is, it only works for the cds predicted by bakta, so I think the first point is what I want to achieve. I suggest that this function can be applied to shorter sequence comparisons first (for example, regions within 20k, which seems more conducive to comparative genomic analysis).

Dx-wmc commented 4 months ago

I tried another software, lifton(https://github.com/Kuanhao-Chao/LiftOn), which fits my idea of being able to do migration from annotation of one genome to another, but it is designed for eukaryotes, is it possible to borrow ideas from it to apply to bakta?

oschwengers commented 1 month ago

Sorry for the late reply.

The second point you mentioned --protein I have tried, it has certain limitations, that is, it only works for the cds predicted by bakta

Actually, user-provided protein annotations via --proteins should be applied to user-provided CDS regions via --region, as well.

I tried another software, lifton(https://github.com/Kuanhao-Chao/LiftOn), which fits my idea of being able to do migration from annotation of one genome to another, but it is designed for eukaryotes, is it possible to borrow ideas from it to apply to bakta?

Yes, in principle, one could adopt these things. However, Bakta was designed for automated de-novo annotations. Hence, I'm a bit reluctant to add all the implied additional huge complexity. In some cases, it might be better to use these external annotation lifting tools.

oschwengers commented 1 month ago

I guess, the original topic of this issue is addressed and solved. Hence, I'd close this for now. Otherwise, please do not hesitate to re-open this one, or file a new one regarding the lifting of annotations (just to keep issues lean and more understandable). Thanks!

oschwengers / bakta

Make the --region parameter better match the output of PGAP #288