oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
447 stars 55 forks source link

Errors in submitting annotations to NCBI #330

Closed kevinmyers closed 1 month ago

kevinmyers commented 1 month ago

I submitted Bakta annotations to NCBI this week and over half had some fatal errors. They weren't hard to fix, but I wanted to let you know in case there's something that can be done with a future update to avoid them. I am using Bakta version 1.9.1 installed using conda and ran with the --compliant tag.

FATAL: SUSPECT_PRODUCT_NAMES: 1 feature equals 'tmRNA'. Is this a tmRNA or is it a protein? (Looking at the product it appears to be a hypothetical protein, so I changed it to that)

FATAL: SUSPECT_PRODUCT_NAMES: 1 feature starts with '-' (Product name: putative-PNPOx domain-containing protein)

FATAL: SUSPECT_PRODUCT_NAMES: 2 features start with ''' (Product name: 'chromo' domain containing protein) (Product name: 'Cold-shock' DNA-binding domain)

FATAL: 1 feature contains 'remnant' (Product name: Remnant of transposase, IS3 family)

FATAL: SUSPECT_PRODUCT_NAMES: 2 features contain '#' (Product name: ATPase/5###-3### helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V)) (Product name: 3###-5### helicase subunit RecB of the DNA repair enzyme RecBCD (exonuclease V)) (Product name: putative DNA-binding protein with ###double-wing### structural motif, MmcQ/YjbR family) (Product name: Anthranilate synthase, amidotransferase component Para-aminobenzoate synthase, amidotransferase component # TrpAbPabAb) (Product name: Chorismate mutase I # AroHI)

FATAL: RRNA_NAME_CONFLICTS: 3 rRNA product names are not standard. Correct the names to the standard format, eg "16S ribosomal RNA" (Product name: (partial) 23S ribosomal RNA) (Product name: (5' truncated) 16S ribosomal RNA)

oschwengers commented 1 month ago

Hi @kevinmyers , thanks a lot for reaching out and reporting these things. It's hard to catch up with all potential submission issues, especially with cluttered-up protein names, but we will do our best.

I will add a couple of sanitizing rules and steps so that we will be able to handle as many of them as possible, soon.

Thanks again. I'll keep you updated.

kevinmyers commented 1 month ago

No problem. Bakta is the best annotation tool I've used for annotating our metagenomics samples. I love it and am happy to do whatever I can to help improve it.

oschwengers commented 1 month ago

OK, I have added a few additional checks and product improvements fixing the following:

However, for the following I need an example or better the exact feature entry, e.g. from the tsv or json file? This would help to pinpoint these cases.

kevinmyers commented 1 month ago

Thanks @oschwengers!

I'm attaching one of the discrepancy reports for the tmRNA problem. Here is the associated lines in the GFF file:

LacMBR1_d26_Ctrl_pb1    Prodigal    gene    585203  585481  .   -   .   ID=ACE6IH_02570_gene;locus_tag=ACE6IH_02570

LacMBR1_d26_Ctrl_pb1    Prodigal    CDS 585203  585481  .   -   0   ID=ACE6IH_02570;Name=hypothetical_protein;locus_tag=ACE6IH_02570;product=hypothetical_protein;Parent=ACE6IH_02570_gene;inference=ab initio prediction:Prodigal:2.6;Note=RefSeq:WP_048373218.1,SO:0001217,UniParc:UPI0006533E88,UniRef:UniRef100_A0A0J6JD09,UniRef:UniRef50_A0A7Y1EVI1,UniRef:UniRef90_A0A6A7YFX6

Discrepancy_UW_FK_PSEUD1_1_out.txt

oschwengers commented 1 month ago

Hmm, very odd/interesting. There are indeed entries in UniRef solely annotated with TmRNA. Bakta now discards these annotations, since they're not very informative anyway.

Again, thanks a lot for reporting! These changes are now public in the main branch, and will be released with v1.10.0 soon. Just in case you face more of these often-occurring fatal errors, please do not hesitate to keep posting them (in new issues).