Closed michoug closed 3 years ago
Hi @michoug , thanks a lot for reporting this. So far we've tested the submission only for ENA. Of course, we're keen to make NCBI submissions as smooth as possible, too.
I'll encode the products as requested in the GFF3 specifications.
For the 1st and 3rd point, I think it might be best to add a --compliant
option in line with the Prokka option to explicitly activate this behavior that might not be desired in other situations.
Is this a complete list of all issues you encountered? Also, could you provide an exemplary line of commands you've used to generate the submission files? This could be helpful for other users to go through this process. Maybe I'll add a section to the readme, as well.
Hi, Thanks for the super-fast response. The issues highlighted here are the main ones (e.g FATAL), there are others that depend more on the names of the products (see attached for a list for a genome) Issues.txt
Here is the command that I used to generate submission files:
table2asn_GFF.Linux -M n -J -c w -t template.sbt -l paired-ends -j "[organism=Pseudomonas sp][strain=E102] [gcode=11]" -i E102_bakta/E102.fna -f E102_bakta/E102.gff3 -o E102_bakta/E102.sqn -Z
Here the link for the documentation (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/#run)
Thanks for the detailed information - that helps a lot.
I've already addressed the lacking gene
and product encoding issues.
However, fixing the Dbxref
s and fatal product descriptions might take somewhat longer. But I've put this on the list for the upcoming 1.1
version which will hopefully be released in the next weeks.
interesting side effect: adhering to the GFF3 comma encoding convention (%2C
) leads to FATAL: SUSPECT_PRODUCT_NAMES: 62 features contain '%'
. Any idea how that could be bypassed? Or is this something that maybe shoulf be reported upstream to be fixed in the table2asn_GFF
tool?
Hi @michoug , I've added a couple of fixes and improvements for GFF3 based GenBank submissions via table2asn_GFF
.
All of the points you've raised above should be addressed and all issues should be solved. If this is not the case, please do not hesitate to reach out and re-open this issue.
I'll release v1.1.0
containing these improvements soon - most certainly next week.
Please let me know if there are any further issues - I'm looking forward to your feedback. Thanks again for reporting and best regards!
Hi, Congrats for all the fast work, I have a few others "issues" that may be eventually addressed, even though I'm well aware that this process sometimes is a bottomless pit and quite tricky to automatize...
SUSPECT_PRODUCT_NAMES: 8 features May contain plural
E141.sqn:CDS Urea carboxylase without Allophanate hydrolase 2 domains lcl|contig_1:c493999-492260 GKKCFE_02155
E141.sqn:CDS Phosphotransferase system, HPr-related proteins lcl|contig_1:c658214-657810 GKKCFE_02965
E141.sqn:CDS Hemolysins-related protein containing CBS domains lcl|contig_1:c830356-829115 GKKCFE_03775
E141.sqn:CDS Phage tail assembly chaperone proteins, E, or 41 or 14 lcl|contig_1:952781-953356 GKKCFE_04360
E141.sqn:CDS Peptidoglycan/LPS O-acetylase OafA/YrhL, contains acyltransferase and SGNH-hydrolase domains lcl|contig_1:c1007564-1006416 GKKCFE_04650
E141.sqn:CDS Diguanylate cyclase with PAS/PAC and GAF sensors lcl|contig_1:1171567-1172943 GKKCFE_05445
SUSPECT_PRODUCT_NAMES: 31 features contain 'unknown'
E141.sqn:CDS Family of unknown function (DUF6124) lcl|contig_1:c109230-108889 GKKCFE_00500
E141.sqn:CDS Family of unknown function (DUF6124) lcl|contig_1:254342-254698 GKKCFE_01120
E141.sqn:CDS Family of unknown function (DUF6124) lcl|contig_1:580095-580460 GKKCFE_02580
SUSPECT_PRODUCT_NAMES: 34 features contains three or more numbers together that may be identifiers more appropriate in note
E141.sqn:CDS Uvs098 lcl|contig_1:252015-252467 GKKCFE_01095
E141.sqn:CDS UPF0313 protein PSPTO_4928 lcl|contig_1:302226-304526 GKKCFE_01330
E141.sqn:CDS L-pipecolate oxidase (1537) lcl|contig_1:320431-321714 GKKCFE_01405
E141.sqn:CDS HI0933-like protein lcl|contig_1:c490707-489466 GKKCFE_02145
E141.sqn:CDS Putative hydro-lyase B723_09185 lcl|contig_1:c496428-495622 GKKCFE_02165
E141.sqn:CDS UPF0114 protein C7528_102400 lcl|contig_1:554275-554763 GKKCFE_02435
E141.sqn:CDS UPF0225 protein CD58_06560 lcl|contig_1:c1018229-1017732 GKKCFE_04695
E141.sqn:CDS UPF0276 protein SAMN03159293_01947 lcl|contig_1:c1039843-1038974 GKKCFE_04820
SUSPECT_PRODUCT_NAMES: 188 features contain underscore
E141.sqn:CDS GBBH-like_N domain-containing protein lcl|contig_1:c125879-125502 GKKCFE_00600
E141.sqn:CDS FAD_binding_3 domain-containing protein lcl|contig_1:c168453-167206 GKKCFE_00760
E141.sqn:CDS ABC_trans_aux domain-containing protein lcl|contig_1:261845-262549 GKKCFE_01150
E141.sqn:CDS MotA_ExbB domain-containing protein lcl|contig_1:272991-273842 GKKCFE_01195
E141.sqn:CDS UPF0313 protein PSPTO_4928 lcl|contig_1:302226-304526 GKKCFE_01330
E141.sqn:CDS Peripla_BP_6 domain-containing protein lcl|contig_1:322080-323216 GKKCFE_01410
E141.sqn:CDS Znf/thioredoxin_put domain-containing protein lcl|contig_1:c389672-388437 GKKCFE_01700
E141.sqn:CDS Cupin_3 domain-containing protein lcl|contig_1:c469729-469385 GKKCFE_02035
E141.sqn:CDS ZT_dimer domain-containing protein lcl|contig_1:c476803-475910 GKKCFE_02080
SUSPECT_PRODUCT_NAMES: 1 feature contains '(TC'
E141.sqn:CDS Sodium/proton antiporter, CPA1 family (TC 2A36) lcl|contig_1:c3087019-3085778 GKKCFE_13940
SUSPECT_PRODUCT_NAMES: 1 feature contains 'FOG'
E141.sqn:CDS FOG: TPR repeat, SEL1 subfamily lcl|contig_1:c4136402-4136001 GKKCFE_18665
FATAL: SUSPECT_PRODUCT_NAMES: 1 feature contains '?'
E141.sqn:CDS ABC transporter, substrate-binding protein (Cluster 15, trp?) lcl|contig_1:4026495-4027427 GKKCFE_18180
FATAL: SUSPECT_PRODUCT_NAMES: 2 features contain '@'
E141.sqn:CDS Deblocking aminopeptidase @ Cyanophycinase 2 lcl|contig_1:c1423448-1422258 GKKCFE_06635
E141.sqn:CDS Maleylacetoacetate isomerase @ Glutathione S-transferase, zeta lcl|contig_1:c4755920-4755285 GKKCFE_21485
SUSPECT_PRODUCT_NAMES: Use short product name instead of descriptive phrase
SUSPECT_PRODUCT_NAMES: 1 feature ends with 'activity'
E141.sqn:CDS HD-like signal output (HDOD) domain, no enzymatic activity lcl|contig_1:5955114-5956325 GKKCFE_27025
SUSPECT_PRODUCT_NAMES: 4 features Is longer than 100 characters. Remove descriptive phrases or synonyms from product names. Keep valid long product names, eg long enzyme names
E141.sqn:CDS Multicopper oxidase with three cupredoxin domains (Includes cell division protein FtsP and spore coat protein CotA) lcl|contig_1:819899-821275 GKKCFE_03735
E141.sqn:CDS GTP pyrophosphokinase, (P)ppGpp synthetase I / Guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase lcl|contig_1:c4197461-4195215 GKKCFE_18975
E141.sqn:CDS Glyoxylate reductase / Glyoxylate reductase / Hydroxypyruvate reductase 2-ketoaldonate reductase, broad specificity lcl|contig_1:4747621-4748592 GKKCFE_21440
E141.sqn:CDS Glycine betaine/carnitine/choline ABC transporter, periplasmic glycine betaine/carnitine/choline-binding protein lcl|contig_1:4859680-4860582 GKKCFE_21975
SUSPECT_PRODUCT_NAMES: 1 feature contains 'possibly'
E141.sqn:CDS Membrane protein TerC, possibly involved in tellurium resistance lcl|contig_1:c5854787-5854020 GKKCFE_26505
SUSPECT_PRODUCT_NAMES: 3 features contain 'gene'
E141.sqn:CDS Yibq gene product, putative divergent polysaccharide deacetylase lcl|contig_1:c43395-42619 GKKCFE_00250
E141.sqn:CDS ABC transporter in pyoverdin gene cluster, ATP-binding component lcl|contig_1:3868307-3869059 GKKCFE_17350
E141.sqn:CDS YebG, DNA damage-inducible gene in SOS regulon, expressed in stationary phase lcl|contig_1:4752788-4753048 GKKCFE_21470
BAD_GENE_NAME: 6 genes contain suspect phrase or characters
E141.sqn:Gene 5_ureB_sRNA lcl|contig_1:346126-346411 GKKCFE_01530
E141.sqn:Gene epd,gap,gapA lcl|contig_1:c1070158-1069157 GKKCFE_04950
E141.sqn:Gene Bacteria_small_SRP lcl|contig_1:1650897-1650993 GKKCFE_07575
E141.sqn:Gene RNaseP_bact_a lcl|contig_1:c4698601-4698249 GKKCFE_21195
E141.sqn:Gene epd,gap,gapA lcl|contig_1:c5325075-5324020 GKKCFE_24085
E141.sqn:Gene Pseudomon-1 lcl|contig_1:5829418-5829534 GKKCFE_26390
Hi,
I've tried to address as many SUSPECT_PRODUCT_NAMES
as possible:
These are the low hanging fruits. All the other remaining issues are way more complex to fix - if they can be handled in an automatic manner at all. I'll try to add some more "fix&replace" rules from time to time and I'm open to all sorts of ideas, suggestions and improvements from the community! Thanks for all the reports! I'll release a patch version soon.
Hi, I'm in the process of submitting annotated genomes with Bakta to the NCBI, hence while checking for the quality and errors of annotation (via table2asn and the gff3 file https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/#run), I encountered several issues with it. I understand if it's not something this tool will be compatible with as it can be quite tricky but anyhow, here is a list of some of the issues that I had that may be addressed in the gff file:
Commas that are intended to be part of a name should be encoded (%2C) according to the GFF3 specifications. However, literal commas should only be included when they are part of enzymatic names. Semi-colons generally should not be included in product names.
SO:
in dbxref as they are not yet recognized (https://www.ncbi.nlm.nih.gov/genbank/collab/db_xref/)Best Greg