tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
831 stars 226 forks source link

GenBank file: Contig name and length collision #153

Open VGalata opened 8 years ago

VGalata commented 8 years ago

Hey!

I have a problem with the GenBank files created by Prokka (v. 1.11) - the contig name and its length are not separated by a white space. I assume the reason is that I increased $MAXCONTIGIDLEN to 40 because I wanted to use the sample IDs (e.g. C3830-198) as prefix and locus tags. And the longer contig names seem to be problematic: If the string (contig ID + its length) has 29 characters or more then no white space is added between the ID and the length string. Example:

LOCUS C3830-198_contig000001443776 bp DNA linear 08-FEB-2016
LOCUS C3830-199_contig000001 65910 bp DNA linear 11-FEB-2016

Used version: v. 1.11 Additional settings: Set $MAXCONTIGIDLEN to 40 in the Prokka executable. CMD:

prokka --force --outdir <odir> --prefix <ID> --locustag <ID> --centre <centre> --gram neg --mincontiglen 200 --cpus 10 <fasta file>
peterjc commented 8 years ago

Duplicate of, or closely related to, #135?

See also past issues like #32, #76, #113

VGalata commented 8 years ago

Yes, it is related to issue #32. Sorry for the duplicate post, somehow I did not find it then I searched for questions about the same issue.

So there is no workaround to change that if I want to use my IDs as prefix and locus tags, right? Setting $MAXCONTIGIDLEN to an appropriate number solves the issue described in #135 making Prokka to use the supplied IDs but the GenBank files may contain wrongly formatted locus lines.

aleimba commented 8 years ago

If you don't want Prokka to rename your contig SeqIDs don't set --compliant and --centre, see issue #141. However, if your SeqIDs are too long you cannot get "correct" Genbank files as you described. Prokka relies on tbl2asn to create the Genbank flatfiles and this tool is very strict. See esp. #76 mentioned by @peterjc.

nikolay12 commented 8 years ago

I'm getting "Contig ID must <= 20 chars lon" for contig names generated by spades. The advice on various blog posts was to use --centre. This has changed the name of the contigs but I'm still getting the same error. As it happens, Prokka is not directly usable with spades. Any advice or shall I just move to RAST?

peterjc commented 8 years ago

@nikolay12 You could try renaming the contig in your SPADES assembly to sometimes short like c00001, c00002, ... with the original name in the FASTA description, and give that to Prokka?

tseemann commented 8 years ago

If you use the latest github HEAD I have changed some code to make smaller contig names. It might help when you use --compliant mode.

peterjc commented 8 years ago

Yes, https://github.com/tseemann/prokka/commit/92940bcd299dea710a17f2954045ea0eada9121c ought to help - thanks!

bsglicker commented 5 years ago

This is a serious issue in my opinion. Users of SnapGene and SnapGene Viewer are accustomed to opening GenBank files, but when the LOCUS line looks like this:

LOCUS NODE_1_length_283141_cov_27.6228283141 bp DNA linear

the importer doesn't work. The GenBank standard stipulates that “users parse the LOCUS line based on whitespace-separated tokens”. Prokka is not compliant.

Is there a way to force a whitespace before the sequence length?

peterjc commented 5 years ago

The GenBank changes to move away from the strict column based LOCUS line to white space separation are quite recent.

I wonder if the NCBI have updated tbl2asn to handle this now, in which case Prokka just needs to ensure that that tool is up to date?

valery-shap commented 4 years ago

@bsglicker thank you very much, your message helped me.

nds commented 2 years ago

@tseemann Hi! We are running prokka on some prokaryote assemblies from INSDC. The contig names (submitted by the users) are fairly long and we have had to use --compliant --centre X to overcome the failures due to name lengths. I have a couple of questions: 1) Does prokka maintain a mapping of the contig names that it renamed? I can't seem to find such a file but it's possible I've missed it. We really do need to revert to the original contig names in the GFF files. We do not need the Genbank files at all. 2) I've read above that switching to the latest version of tbl2asn may get rid of this problem. Are there plans for prokka to do that?