oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
448 stars 55 forks source link

gene= missing for some copies of the gene but present in others #329

Closed ebraginngd closed 5 hours ago

ebraginngd commented 1 month ago

Dear @oschwengers thanks very much for amazing annotation tool. I wonder if the following is a bug or maybe we are using it wrong:

We tried annotating an E.coli assembly with the following command:

bakta --db /mnt/db-full/ -o /mnt/out_path -t 2 assembly.fasta with the latest docker image oschwengers/bakta:latest

We see some genes have gene= short names and some don't, is there way to enforce short names? Interestingly the same gene (Outer membrane porin C) of which there are two slightly different copies, one was annotated with the short name: contig_104 Prodigal CDS 2607 3707 . - 0 ID=LIIJEP_24530;Name=Outer membrane porin C 2;locus_tag=LIIJEP_24530;product=Outer membrane porin C 2;Dbxref=COG:COG3203,COG:M,GO:0009279,GO:0015288,GO:0034220,GO:0046930,KEGG:K16076,RefSeq:WP_000768393.1,SO:0001217,UniParc:UPI00016A10FE,UniRef:UniRef100_A0A0D8WD33,UniRef:UniRef50_P06996,UniRef:UniRef90_A0A4P7TME1;gene=ompC2

and one without: contig_2 Prodigal CDS 40084 41178 . + 0 ID=LIIJEP_01180;Name=Outer membrane porin C;locus_tag=LIIJEP_01180;product=Outer membrane porin C;Dbxref=RefSeq:WP_000865539.1,SO:0001217,UniParc:UPI00000B81BE,UniRef:UniRef100_Q9K597,UniRef:UniRef50_Q56828,UniRef:UniRef90_Q9K597

Sample in question is DRR387971

ebraginngd commented 1 month ago

Just to add here, when I tried the online web tool it did annotate this entry with gene=ompC, however I see that both the software and db versions are different: https://bakta.computational.bio/job/eyJqb2JJRCI6IjA2YjI2NmZiLWRiZGQtNGY5My05YzFkLTFiYWUwNTEwYWM3YiIsInNlY3JldCI6Ikg3SUhrVDFhRzQ1Qk1vREVKODJybzdMT3ltZ29TSE5ZQVRXTjBNUmdFNmcifQ==

oschwengers commented 1 month ago

Hi @ebraginngd , thanks for reaching out with this. In principle, and based on the information provided above, this is not a bug, but just the occurence of two different genes having a fairly equal functional description.

As you can see in the Dbxrefs, the first is a member of the UniRef50_P06996 protein cluster that is annotated with a gene symbol ompC, while the second is a member of the UniRef50_Q56828 protein cluster without any gene symbol annotation.

Since these are members of two different UniRef50 clusters, we can assure, that these have a mutual sequence identity of max 50 % - which is fairly low. Hence, It could simply be the case, that these are in fact two different genes.

... OR it could simply be the very common case, that one protein cluster in UniRef is better annotated than others.

I hope this helps to clarify this a bit. If not, please do not hesitate to keep asking.