Open ktmeaton opened 2 months ago
Thanks a lot @ktmeaton for reporting! In principle UniRef accessions are stable identifiers, but unfortunately, they change with each UniProt release, which cannot be changed due to the underlying cluster approach. I'd happily take a deeper look into that, but in order to do so, I'd need the complete log file - if that is OK for you.
Thanks for the response! Ah, that makes sense about the UniRef accessions changing compared to when the bakta database was released.
For the complete log file, do you need the NCTC12064.log with the host system information/file paths restored? I had just originally redacted them for security purposes.
OK, I just took a deeper look into this. Indeed, the protein with UniRef100 ID UPI000DA29B8C
belongs to the UniRef90 ID UPI001CF4D4D8
which is/was annotated by UniProt as YSIRK-type signal peptide-containing protein
and as member of the broader UniRef50 cluster as M protein, serotype 2.1
. So, this is not related to the workflow of Bakta, but to the database and its underlying database source, in this case UniProt. In this case, there's nothing much we could do about this. You could report an improved annotation for this to UniProt so that they could feed it into their annotation databases?
I annotated the S. pyogenes reference with bakta, and noticed some oddly named CDS. When I compare bakta's results to the RefSeq annotation, the accessions for RefSeq and UniParc are all correct and as expected. But the reported "Name" and "Product" in the bakta gff output doesnt seem to match what's in those databases.
For example, locus
BEAOJI_08040
should be called M-related protein Enn, based on it's RefSeq and UniRef accessions. But the reported name is YSIRK-type signal peptide-containing protein instead. Here are some examples of mismatches, that are in an important sub-typing region:I found this line in the debug log, that says it's looking up
UniRef90_UPI001CF4D4D8
:But it don't seem like UniRef90_UPI001CF4D4D8 exists? UniRef100_UPI001CF4D4D8 does exist and that one is named "YSIRK-type signal peptide-containing protein". But
UniRef100_UPI001CF4D4D8
isn't mentioned anywhere in the log or output.Versions
I'm using bakta
v1.9.2
from the image bakta:1.9.2--pyhdfd78af_0 and thev5.1-full
database.NCTC12064.log