oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
419 stars 47 forks source link

Name and Product don't match Dbxref #287

Open ktmeaton opened 2 months ago

ktmeaton commented 2 months ago

I annotated the S. pyogenes reference with bakta, and noticed some oddly named CDS. When I compare bakta's results to the RefSeq annotation, the accessions for RefSeq and UniParc are all correct and as expected. But the reported "Name" and "Product" in the bakta gff output doesnt seem to match what's in those databases.

For example, locus BEAOJI_08040 should be called M-related protein Enn, based on it's RefSeq and UniRef accessions. But the reported name is YSIRK-type signal peptide-containing protein instead. Here are some examples of mismatches, that are in an important sub-typing region:

NCTC12064_contig_1      Prodigal        CDS     1589565 1590701 .       -       0       ID=BEAOJI_08040;Name=YSIRK-type signal peptide-containing protein;locus_tag=BEAOJI_08040;product=YSIRK-type signal peptide-containing protein;Dbxref=RefSeq:WP_111679867.1,SO:0001217,UniParc:UPI000DA29B8C,UniRef:UniRef100_UPI000DA29B8C,UniRef:UniRef50_P50468,UniRef:UniRef90_UPI001CF4D4D8
NCTC12064_contig_1      Prodigal        CDS     1592413 1593579 .       -       0       ID=BEAOJI_08050;Name=Fibrinogen- and Ig-binding protein;locus_tag=BEAOJI_08050;product=Fibrinogen- and Ig-binding protein;Dbxref=GO:0005576,GO:0019864,RefSeq:WP_038431637.1,SO:0001217,UniParc:UPI0004D1BE22,UniRef:UniRef100_UPI0004D1BE22,UniRef:UniRef50_P30141,UniRef:UniRef90_P30141;gene=mrp4
Locus Bakta Refseq UniRef RefSeq Accession UniRef Accession
BEAOJI_08040 YSIRK-type signal peptide-containing protein M-related protein Enn M-related protein Enn WP_111679867.1 UniRef100_UPI000DA29B8C
BEAOJI_08050 Fibrinogen- and Ig-binding YSIRK-type signal peptide-containing protein YSIRK-type signal peptide-containing protein WP_038431637.1 UniRef100_UPI0004D1BE22

I found this line in the debug log, that says it's looking up UniRef90_UPI001CF4D4D8:

13:59:41.390 - DEBUG - PSC - lookup: contig=NCTC12064_contig_1, start=1589565, stop=1590701, strand=-, UniRef90=UniRef90_UPI001CF4D4D8, EC=, gene=, product=YSIRK-type signal peptide-containing protein

But it don't seem like UniRef90_UPI001CF4D4D8 exists? UniRef100_UPI001CF4D4D8 does exist and that one is named "YSIRK-type signal peptide-containing protein". But UniRef100_UPI001CF4D4D8 isn't mentioned anywhere in the log or output.

Versions

I'm using bakta v1.9.2 from the image bakta:1.9.2--pyhdfd78af_0 and the v5.1-full database.

bakta \
    --debug --genus Streptococcus --species pyogenes \
    --threads 9 \
    --prefix NCTC12064 \
    --db 5.1 \
    --locus NCTC12064_contig \
    Streptoccocus_pyogenes_strain_NCTC12064.fasta \
    > NCTC12064.out 2>&1

NCTC12064.log

oschwengers commented 2 months ago

Thanks a lot @ktmeaton for reporting! In principle UniRef accessions are stable identifiers, but unfortunately, they change with each UniProt release, which cannot be changed due to the underlying cluster approach. I'd happily take a deeper look into that, but in order to do so, I'd need the complete log file - if that is OK for you.

ktmeaton commented 2 months ago

Thanks for the response! Ah, that makes sense about the UniRef accessions changing compared to when the bakta database was released.

For the complete log file, do you need the NCTC12064.log with the host system information/file paths restored? I had just originally redacted them for security purposes.

oschwengers commented 2 weeks ago

OK, I just took a deeper look into this. Indeed, the protein with UniRef100 ID UPI000DA29B8C belongs to the UniRef90 ID UPI001CF4D4D8 which is/was annotated by UniProt as YSIRK-type signal peptide-containing protein and as member of the broader UniRef50 cluster as M protein, serotype 2.1. So, this is not related to the workflow of Bakta, but to the database and its underlying database source, in this case UniProt. In this case, there's nothing much we could do about this. You could report an improved annotation for this to UniProt so that they could feed it into their annotation databases?