tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
831 stars 226 forks source link

tbl2asn (25.8) error based on prokka command line flags #460

Open MGNute opened 4 years ago

MGNute commented 4 years ago

I have been dealing with this tbl2asn bug for the last few days and I have also found that it works for some input files and not others. At first I found that prokka ran fine on a set of contigs created by Megahit, but not one created by Metaspades. While I was testing this, in one of the runs I accidentally entered the prokka command with only the first few arguments for the same Metaspades contigs, and to my surprise it ran successfully. So the same commands with different files can make it work or not, and the same input file with different command flags can make it work or not. I am using the singularity image 1.14.5-1 and running on ubuntu 18.04.3. I have included the commands below and I have posted the input and output for each of these on dropbox so you can reproduce this.

To be clear, the singularity image has what appears to be the latest version of tbl2asn (25.8). But just to be extra certain I downloaded the latest version of it from the website, tried it on both of the metaspades results and got the same outcome. One ran in an hour or two and the other just ran indefinitely. For the first little while it was printing warning messages to the console (mostly about unrecognized feature CRISPER), but at some point it stopped doing that and ran for somewhere between 12 and 24 hours without doing anything before I killed it.

I didn't save the actual console outputs in every case, but if it would help I can re-run them with screen logging. They seem to be similar to the error messages everybody else has seen though. If there's any additional information I can give that would help with this, I would be happy to. Thanks!

Here are the commands:

# worked:
export LC_ALL=C
prokka ./SRR606249_metaspades_contigs.fa --metagenome

# failed:
export LC_ALL=C
prokka ./SRR606249_metaspades_contigs.fa --metagenome --eval 1e-06 --proteins /NGStools/prokka/db/kingdom/Bacteria/IS --notrna --rnammer --rawproduct --compliant --outdir ./<metaspades_prokka_annotation> --prefix SRR606249_metaspades_contigs --cpus 8

# worked:
export LC_ALL=C
prokka ./SRR606249_megahit_contigs.fa --metagenome --eval 1e-06 --proteins /NGStools/prokka/db/kingdom/Bacteria/IS --notrna --rnammer --rawproduct --compliant --outdir ./<megahit_prokka_annotation> --prefix SRR606249_megahit_contigs --cpus 8

Here are the files. I've included a readme that describes the metaspades files in a bit of detail (mostly the same as above though). Links: readme metaspades input (44 MB) metaspades output (both versions) (530 MB) megahit input & output (34 MB)

ealdraed commented 4 years ago

Hello @MGNute and thanks for taking the time to report your problems.

I suspected this is related to #441 and it looks like it is. I ran the command that was failing for you and it completed after some time. tbl2asn took around 10 hours to complete:

[...]
[18:37:48] Running: tbl2asn -V b -a r10k -l paired-ends -M b -N 1 -y 'Annotated using prokka 1.14.5 from https://github.com/tseemann/prokka' -Z metaspades_prokka_annotation\/SRR606249_metaspades_contigs\.err -i metaspades_prokka_annotation\/SRR606249_metaspades_contigs\.fsa 2> /dev/null
[04:36:17] Deleting unwanted file: metaspades_prokka_annotation/errorsummary.val
[...]

The problem is that the resulting GenBank file also had the problem with the /translation containing DNA sequences for locus tags starting @ 10000. To be more clear, when you instruct Prokka to use --centre or --compliant it will rename the contig names and they will have a name similar to the /locus_tag or /protein_id.

Compare

>gnl|Prokka|CKLIBFJE_10000 [gcode=11] [organism=Genus species] [strain=strain]

with

/locus_tag="CKLIBFJE_10000"
/protein_id="Prokka:CKLIBFJE_10000"

This seems to confuse tbl2asn.

The annotation of the Megahit assembled contigs will suffer the same problem in the GenBank file.

To sum this up: (1) All commands complete sucessfully (although this might take considerable time in the tbl2asn step). (2) Both MetaSPAdes and Megahit annotations suffer from erroneous translations for locus tag 10000 and upward if --centre/--compliant is used. (3) The run time difference might be related to contig size differences in both assemblies and the aforementioned bug (tbl2asn needing more time to process longer contigs).

@tseemann My solution as mentioned in #441 is to alter the renaming (maybe just add the letter c) or completely revert to the original format (contigXXXXXX; X = digit) to distinguish locus_tag/protein_id from contig names: https://github.com/tseemann/prokka/blob/290466d7d3f198c556884c0e72ee7474479301d4/bin/prokka#L452