oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
444 stars 55 forks source link

Compatible file for NCBI submission? #69

Closed michoug closed 3 years ago

michoug commented 3 years ago

Hi, I'm in the process of submitting annotated genomes with Bakta to the NCBI, hence while checking for the quality and errors of annotation (via table2asn and the gff3 file https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/#run), I encountered several issues with it. I understand if it's not something this tool will be compatible with as it can be quite tricky but anyhow, here is a list of some of the issues that I had that may be addressed in the gff file:

contig_1    Prodigal    CDS 3   179 .   -   0   ID=DOCECA_00005;locus_tag=DOCECA_00005;product=hypothetical protein
contig_1    Bakta   gene    3   179 .   -   0   ID=DOCECA_00005;locus_tag=DOCECA_00005

Commas that are intended to be part of a name should be encoded (%2C) according to the GFF3 specifications. However, literal commas should only be included when they are part of enzymatic names. Semi-colons generally should not be included in product names.

Best Greg

oschwengers commented 3 years ago

Hi @michoug , thanks a lot for reporting this. So far we've tested the submission only for ENA. Of course, we're keen to make NCBI submissions as smooth as possible, too.

I'll encode the products as requested in the GFF3 specifications.

For the 1st and 3rd point, I think it might be best to add a --compliant option in line with the Prokka option to explicitly activate this behavior that might not be desired in other situations.

Is this a complete list of all issues you encountered? Also, could you provide an exemplary line of commands you've used to generate the submission files? This could be helpful for other users to go through this process. Maybe I'll add a section to the readme, as well.

michoug commented 3 years ago

Hi, Thanks for the super-fast response. The issues highlighted here are the main ones (e.g FATAL), there are others that depend more on the names of the products (see attached for a list for a genome) Issues.txt

Here is the command that I used to generate submission files:

Here the link for the documentation (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/#run)

oschwengers commented 3 years ago

Thanks for the detailed information - that helps a lot. I've already addressed the lacking gene and product encoding issues.

However, fixing the Dbxrefs and fatal product descriptions might take somewhat longer. But I've put this on the list for the upcoming 1.1 version which will hopefully be released in the next weeks.

oschwengers commented 3 years ago

interesting side effect: adhering to the GFF3 comma encoding convention (%2C) leads to FATAL: SUSPECT_PRODUCT_NAMES: 62 features contain '%'. Any idea how that could be bypassed? Or is this something that maybe shoulf be reported upstream to be fixed in the table2asn_GFF tool?

oschwengers commented 3 years ago

Hi @michoug , I've added a couple of fixes and improvements for GFF3 based GenBank submissions via table2asn_GFF. All of the points you've raised above should be addressed and all issues should be solved. If this is not the case, please do not hesitate to reach out and re-open this issue.

I'll release v1.1.0 containing these improvements soon - most certainly next week.

Please let me know if there are any further issues - I'm looking forward to your feedback. Thanks again for reporting and best regards!

michoug commented 3 years ago

Hi, Congrats for all the fast work, I have a few others "issues" that may be eventually addressed, even though I'm well aware that this process sometimes is a bottomless pit and quite tricky to automatize...

SUSPECT_PRODUCT_NAMES: 8 features May contain plural
E141.sqn:CDS    Urea carboxylase without Allophanate hydrolase 2 domains    lcl|contig_1:c493999-492260 GKKCFE_02155
E141.sqn:CDS    Phosphotransferase system, HPr-related proteins lcl|contig_1:c658214-657810 GKKCFE_02965
E141.sqn:CDS    Hemolysins-related protein containing CBS domains   lcl|contig_1:c830356-829115 GKKCFE_03775
E141.sqn:CDS    Phage tail assembly chaperone proteins, E, or 41 or 14  lcl|contig_1:952781-953356  GKKCFE_04360
E141.sqn:CDS    Peptidoglycan/LPS O-acetylase OafA/YrhL, contains acyltransferase and SGNH-hydrolase domains    lcl|contig_1:c1007564-1006416   GKKCFE_04650
E141.sqn:CDS    Diguanylate cyclase with PAS/PAC and GAF sensors    lcl|contig_1:1171567-1172943    GKKCFE_05445

SUSPECT_PRODUCT_NAMES: 31 features contain 'unknown'
E141.sqn:CDS    Family of unknown function (DUF6124)    lcl|contig_1:c109230-108889 GKKCFE_00500
E141.sqn:CDS    Family of unknown function (DUF6124)    lcl|contig_1:254342-254698  GKKCFE_01120
E141.sqn:CDS    Family of unknown function (DUF6124)    lcl|contig_1:580095-580460  GKKCFE_02580

SUSPECT_PRODUCT_NAMES: 34 features contains three or more numbers together that may be identifiers more appropriate in note
E141.sqn:CDS    Uvs098  lcl|contig_1:252015-252467  GKKCFE_01095
E141.sqn:CDS    UPF0313 protein PSPTO_4928  lcl|contig_1:302226-304526  GKKCFE_01330
E141.sqn:CDS    L-pipecolate oxidase (1537) lcl|contig_1:320431-321714  GKKCFE_01405
E141.sqn:CDS    HI0933-like protein lcl|contig_1:c490707-489466 GKKCFE_02145
E141.sqn:CDS    Putative hydro-lyase B723_09185 lcl|contig_1:c496428-495622 GKKCFE_02165
E141.sqn:CDS    UPF0114 protein C7528_102400    lcl|contig_1:554275-554763  GKKCFE_02435
E141.sqn:CDS    UPF0225 protein CD58_06560  lcl|contig_1:c1018229-1017732   GKKCFE_04695
E141.sqn:CDS    UPF0276 protein SAMN03159293_01947  lcl|contig_1:c1039843-1038974   GKKCFE_04820

SUSPECT_PRODUCT_NAMES: 188 features contain underscore
E141.sqn:CDS    GBBH-like_N domain-containing protein   lcl|contig_1:c125879-125502 GKKCFE_00600
E141.sqn:CDS    FAD_binding_3 domain-containing protein lcl|contig_1:c168453-167206 GKKCFE_00760
E141.sqn:CDS    ABC_trans_aux domain-containing protein lcl|contig_1:261845-262549  GKKCFE_01150
E141.sqn:CDS    MotA_ExbB domain-containing protein lcl|contig_1:272991-273842  GKKCFE_01195
E141.sqn:CDS    UPF0313 protein PSPTO_4928  lcl|contig_1:302226-304526  GKKCFE_01330
E141.sqn:CDS    Peripla_BP_6 domain-containing protein  lcl|contig_1:322080-323216  GKKCFE_01410
E141.sqn:CDS    Znf/thioredoxin_put domain-containing protein   lcl|contig_1:c389672-388437 GKKCFE_01700
E141.sqn:CDS    Cupin_3 domain-containing protein   lcl|contig_1:c469729-469385 GKKCFE_02035
E141.sqn:CDS    ZT_dimer domain-containing protein  lcl|contig_1:c476803-475910 GKKCFE_02080

SUSPECT_PRODUCT_NAMES: 1 feature contains '(TC'
E141.sqn:CDS    Sodium/proton antiporter, CPA1 family (TC 2A36) lcl|contig_1:c3087019-3085778   GKKCFE_13940

SUSPECT_PRODUCT_NAMES: 1 feature contains 'FOG'
E141.sqn:CDS    FOG: TPR repeat, SEL1 subfamily lcl|contig_1:c4136402-4136001   GKKCFE_18665

FATAL: SUSPECT_PRODUCT_NAMES: 1 feature contains '?'
E141.sqn:CDS    ABC transporter, substrate-binding protein (Cluster 15, trp?)   lcl|contig_1:4026495-4027427    GKKCFE_18180

FATAL: SUSPECT_PRODUCT_NAMES: 2 features contain '@'
E141.sqn:CDS    Deblocking aminopeptidase @ Cyanophycinase 2    lcl|contig_1:c1423448-1422258   GKKCFE_06635
E141.sqn:CDS    Maleylacetoacetate isomerase @ Glutathione S-transferase, zeta  lcl|contig_1:c4755920-4755285   GKKCFE_21485

SUSPECT_PRODUCT_NAMES: Use short product name instead of descriptive phrase
SUSPECT_PRODUCT_NAMES: 1 feature ends with 'activity'
E141.sqn:CDS    HD-like signal output (HDOD) domain, no enzymatic activity  lcl|contig_1:5955114-5956325    GKKCFE_27025

SUSPECT_PRODUCT_NAMES: 4 features Is longer than 100 characters. Remove descriptive phrases or synonyms from product names. Keep valid long product names, eg long enzyme names
E141.sqn:CDS    Multicopper oxidase with three cupredoxin domains (Includes cell division protein FtsP and spore coat protein CotA) lcl|contig_1:819899-821275  GKKCFE_03735
E141.sqn:CDS    GTP pyrophosphokinase, (P)ppGpp synthetase I / Guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase lcl|contig_1:c4197461-4195215   GKKCFE_18975
E141.sqn:CDS    Glyoxylate reductase / Glyoxylate reductase / Hydroxypyruvate reductase 2-ketoaldonate reductase, broad specificity lcl|contig_1:4747621-4748592    GKKCFE_21440
E141.sqn:CDS    Glycine betaine/carnitine/choline ABC transporter, periplasmic glycine betaine/carnitine/choline-binding protein    lcl|contig_1:4859680-4860582    GKKCFE_21975

SUSPECT_PRODUCT_NAMES: 1 feature contains 'possibly'
E141.sqn:CDS    Membrane protein TerC, possibly involved in tellurium resistance    lcl|contig_1:c5854787-5854020   GKKCFE_26505

SUSPECT_PRODUCT_NAMES: 3 features contain 'gene'
E141.sqn:CDS    Yibq gene product, putative divergent polysaccharide deacetylase    lcl|contig_1:c43395-42619   GKKCFE_00250
E141.sqn:CDS    ABC transporter in pyoverdin gene cluster, ATP-binding component    lcl|contig_1:3868307-3869059    GKKCFE_17350
E141.sqn:CDS    YebG, DNA damage-inducible gene in SOS regulon, expressed in stationary phase   lcl|contig_1:4752788-4753048    GKKCFE_21470

BAD_GENE_NAME: 6 genes contain suspect phrase or characters
E141.sqn:Gene   5_ureB_sRNA lcl|contig_1:346126-346411  GKKCFE_01530
E141.sqn:Gene   epd,gap,gapA    lcl|contig_1:c1070158-1069157   GKKCFE_04950
E141.sqn:Gene   Bacteria_small_SRP  lcl|contig_1:1650897-1650993    GKKCFE_07575
E141.sqn:Gene   RNaseP_bact_a   lcl|contig_1:c4698601-4698249   GKKCFE_21195
E141.sqn:Gene   epd,gap,gapA    lcl|contig_1:c5325075-5324020   GKKCFE_24085
E141.sqn:Gene   Pseudomon-1 lcl|contig_1:5829418-5829534    GKKCFE_26390
oschwengers commented 3 years ago

Hi, I've tried to address as many SUSPECT_PRODUCT_NAMES as possible:

These are the low hanging fruits. All the other remaining issues are way more complex to fix - if they can be handled in an automatic manner at all. I'll try to add some more "fix&replace" rules from time to time and I'm open to all sorts of ideas, suggestions and improvements from the community! Thanks for all the reports! I'll release a patch version soon.