oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
419 stars 47 forks source link

Make the --region parameter better match the output of PGAP #288

Open Dx-wmc opened 2 months ago

Dx-wmc commented 2 months ago

While using Bakta, I realized that one of the known genes was not predicted. However, both Prokka and PGAP accurately predicted the gene. I tried using the -region option for this. Although Prokka's gbk file can be read well, PGAP's file is not ideal. Upon examining PGAP's gbk generation, I found that some of its gene fragments are not multiples of 3 in length (these genes are labeled as pseudogenes). After removing these non-3 genes, it ran successfully. However, manually modifying these files is too costly when applying PGAP annotations to Bakta in bulk. Therefore, I would like --region to automatically identify and exclude or skip these problematic genes, significantly improving efficiency.

Dx-wmc commented 2 months ago

Also, another point of confusion for me is, is it not possible to use a gbk file with -- region parameter for different genomes, I have now manually annotated a gbk file. But when I want to apply it to another genome the --region parameter reports an error.

oschwengers commented 3 weeks ago

Hi, and sorry for responding late.

First: To take a deeper look into the first issue: Could you provide me with an example of this? genome in Fasta and both the raw/fixed gbk files?

Second: I'm not sure if I understand correctly what you're trying to achieve. You would like to re-use a Genbank file via --region for another genome?

Dx-wmc commented 3 weeks ago

Hi, I have prepared the example data as requested:

  1. s5.fasta: The genome sequence annotated by PGAP pipeline.
  2. s5.gbk: The raw GenBank file generated by PGAP for s5.fasta contains genes not multiple of 3 in length.
  3. s5_edited.gbk: The manually modified GenBank file where I removed the problematic genes so that Bakta's --region parameter works correctly. Directly using --region s5.gbk results in errors.
  4. 1011.fasta: Another genome sequence for which I would like to reuse the annotations from s5_edited.gbk. However, this fails unless I manually change the first line of s5_edited.gbk from s5 to 1011, which is cumbersome.

Based on these observations, I would like to propose the following enhancements to Bakta:

  1. Automatically exclude or skip genes that are not multiples of 3 in length when using the --region parameter.
  2. Reuse an annotated GenBank file for different genomes without manual modification of the file.

I have attached the example data for your review. test.tar.gz

oschwengers commented 4 days ago

Hi @Dx-wmc, I just pushed a fix to skip a couple of cases where a CDS does not comprise of even triplets, i.e. pseudogenes, partial genes and programmed frameshifts. This addresses and should fix 1.

Regarding 2.: Either I still don't quite get this use case, or I'm not convinced that this is a good idea. In which cases can we just map the predicted CDS regions of genome A to genome B?

Dx-wmc commented 3 days ago

I agree with you that my previous considerations were somewhat lacking (especially in the case of large genomic differences). However, my intention was mainly to apply carefully prepared annotations (e.g., that I had manually modified) to other genomes with high similarity in the case of genomes with high similarity, which would prevent me from manually modifying them one by one.

oschwengers commented 3 days ago

OK, I see your totally valid point and this is certainly something that we should somehow try to address. However, we need to separate two distinct approaches:

  1. refining the structural gene prediction
  2. refining the function gene annotation

For 1. we can use the --region feature which requires exact gene locations for each genome. These can either be provided by manually providing these (which is very tedious), or by an automated process. Currently, we hava a student in our lab working on a helper script trying to automate this.

For 2. you can already use the --protein feature to prioritze your manually curated gene symbols, protein descriptions, etc as described here

Dx-wmc commented 3 days ago

The second point you mentioned --protein I have tried, it has certain limitations, that is, it only works for the cds predicted by bakta, so I think the first point is what I want to achieve. I suggest that this function can be applied to shorter sequence comparisons first (for example, regions within 20k, which seems more conducive to comparative genomic analysis).

Dx-wmc commented 2 days ago

I tried another software, lifton(https://github.com/Kuanhao-Chao/LiftOn), which fits my idea of being able to do migration from annotation of one genome to another, but it is designed for eukaryotes, is it possible to borrow ideas from it to apply to bakta?