Closed kevinmyers closed 1 month ago
Hi @kevinmyers , thanks a lot for reaching out and reporting these things. It's hard to catch up with all potential submission issues, especially with cluttered-up protein names, but we will do our best.
I will add a couple of sanitizing rules and steps so that we will be able to handle as many of them as possible, soon.
Thanks again. I'll keep you updated.
No problem. Bakta is the best annotation tool I've used for annotating our metagenomics samples. I love it and am happy to do whatever I can to help improve it.
OK, I have added a few additional checks and product improvements fixing the following:
However, for the following I need an example or better the exact feature entry, e.g. from the tsv
or json
file? This would help to pinpoint these cases.
Thanks @oschwengers!
I'm attaching one of the discrepancy reports for the tmRNA problem. Here is the associated lines in the GFF file:
LacMBR1_d26_Ctrl_pb1 Prodigal gene 585203 585481 . - . ID=ACE6IH_02570_gene;locus_tag=ACE6IH_02570
LacMBR1_d26_Ctrl_pb1 Prodigal CDS 585203 585481 . - 0 ID=ACE6IH_02570;Name=hypothetical_protein;locus_tag=ACE6IH_02570;product=hypothetical_protein;Parent=ACE6IH_02570_gene;inference=ab initio prediction:Prodigal:2.6;Note=RefSeq:WP_048373218.1,SO:0001217,UniParc:UPI0006533E88,UniRef:UniRef100_A0A0J6JD09,UniRef:UniRef50_A0A7Y1EVI1,UniRef:UniRef90_A0A6A7YFX6
Hmm, very odd/interesting. There are indeed entries in UniRef solely annotated with TmRNA
. Bakta now discards these annotations, since they're not very informative anyway.
Again, thanks a lot for reporting! These changes are now public in the main
branch, and will be released with v1.10.0
soon. Just in case you face more of these often-occurring fatal errors, please do not hesitate to keep posting them (in new issues).
I submitted Bakta annotations to NCBI this week and over half had some fatal errors. They weren't hard to fix, but I wanted to let you know in case there's something that can be done with a future update to avoid them. I am using Bakta version 1.9.1 installed using
conda
and ran with the--compliant
tag.FATAL: SUSPECT_PRODUCT_NAMES: 1 feature equals 'tmRNA'. Is this a tmRNA or is it a protein? (Looking at the product it appears to be a hypothetical protein, so I changed it to that)
FATAL: SUSPECT_PRODUCT_NAMES: 1 feature starts with '-' (Product name: putative-PNPOx domain-containing protein)
FATAL: SUSPECT_PRODUCT_NAMES: 2 features start with ''' (Product name: 'chromo' domain containing protein) (Product name: 'Cold-shock' DNA-binding domain)
FATAL: 1 feature contains 'remnant' (Product name: Remnant of transposase, IS3 family)
FATAL: SUSPECT_PRODUCT_NAMES: 2 features contain '#' (Product name: ATPase/5###-3### helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V)) (Product name: 3###-5### helicase subunit RecB of the DNA repair enzyme RecBCD (exonuclease V)) (Product name: putative DNA-binding protein with ###double-wing### structural motif, MmcQ/YjbR family) (Product name: Anthranilate synthase, amidotransferase component Para-aminobenzoate synthase, amidotransferase component # TrpAbPabAb) (Product name: Chorismate mutase I # AroHI)
FATAL: RRNA_NAME_CONFLICTS: 3 rRNA product names are not standard. Correct the names to the standard format, eg "16S ribosomal RNA" (Product name: (partial) 23S ribosomal RNA) (Product name: (5' truncated) 16S ribosomal RNA)