Large .gbk with wrong /translation sequence

regelka commented 4 years ago

Hi,

over the last years I've annotated several metagenomes using prokka and never had any (serious) problems with the output. A while ago during annotation of a new dataset with prokka 1.13.3 I've encountered a strange error in the .gbk files, as the translation qualifier did not always contain the respective amino acid sequence.

This week I found the time for a deeper look into the problem and could finally identify the source of the error. For testing I've used an old metagenomic dataset (72,417 contigs) originally annotated with prokka 1.11.0 and reannotated it with prokka 1.14.5. Here is an example of the differences:

 prokka 1.11.0
/gene="pcaF_1" /locus_tag="N1_10020" /EC_number="2.3.1.174" /inference="ab initio prediction:Prodigal:2.6" /inference="similar to AA sequence:UniProtKB:Q43974" /codon_start=1 /transl_table=11 /product="Beta-ketoadipyl-CoA thiolase" /protein_id="goettingen:N1_10020" /translation="MNPAYLCDAVRTPFGRLNGSLATVRADDLAALPLKALQVRHPQL DWAAVDDVLLGCANQAGEDNRNVAHMALLLAGLPLQIPGCTLNRLCGSSLDAVAMAAR AIKTGESELMIAGGVESMSRAPFVMGKAESAFSRAMKLEDTTMGWRFINPQMKTLYGV DSMPQTAENVALKFNISRQDQDAFALRSQLRTAAAQESGFFADQLIDVSIAQKKGEPQ LFTQDEHPRATTMEALAKLKPVVDPQGTVTAGNASGLNDGACALLLAGESGLSRHGLQ PMARIVASAVTGIEPSIMGFAPAQAVRKVLKISGLSLDQMDVIELNEAFAAQALAVTR ELGLPDDAAQVNPNGGAIALGHPLGASGGRLVMNAAWQLQNTRGRYGLCTMCIGVGQG IALIIERM"
 prokka 1.14.5
/gene="pcaF_1" /locus_tag="N1_10020" /inference="ab initio prediction:Prodigal:002006" /inference="similar to AA sequence:UniProtKB:Q43974" /note="Annotated using prokka 1.14.5 from https://github.com/tseemann/prokka" /codon_start=1 /transl_table=11 /protein_id="goettingen:N1_10020" /db_xref="COG:COG0183" /translation="AATGGATTGTGGATCATTTTGAACAAGAAGGCAAAACCGTTAAA GTAAATCCCCTGGCATTCATTGATCATCAGCAGGAGTTGGACTTGATTACTCGTGAAG CCAAGTTTGAATATGCAGTGTCCCGGATTTCATGTTAGCGGTTTTAAATAGGAGAACG ATACATATGCTATTTCTACCAAACAAAATCAAAAAAGCTTCAGTATACAATGGCAAAG GGACATTTCAAGAAACAATCAGATTATTTGCCAAATATTCTGAGCAGGATTGGCTGTC ATTTTCATCCCGACAAAGCTGGGAGCAGCATAAGCAGGCGCTGTTTTTATATATTGAA GCCCTTCTCTACACCGAATGTGTTCAAGATGTCGAGACGAAATATCTGGATTTGTTCA AGAAAGTTCATGATGAAGGAGCTCAGCGTTTTGAATATGAGAGAAACTGGTTTATTCT GCTTAACCTGATCGAAGGGTGTTTGCCACTGTTCCAGCCTCATGGCGATACCAAGCAA GTGATAGAAAAGTTGGGACCATTGCTTGAACGGGAACCGTT"

Of 35,510 genes, 2405 contain a nucleotide sequence instead of an amino acid sequence. Looking for the origin of the sequence, it could find it as contig N1_10020 in the .fsa output. When checking other misannotated genes, the nucleotide sequence always came from a contig with an identical name, e.g. for the gene with locus_tag N1_10030 the nucleotide sequence was the whole sequence of contig N1_10030 etc. Contig numbers start with N1_1, whereas locus_tag numbers start with N1_00010 (increasing by 10 for every gene). This explains why this error can only be found from locus_tag N1_10000 to locus_tag N1_72410 as it is the only range with matching contig names. I don't know why not every locus_tag in this region is affected by this, but apparently a missing product qualifier is responsible for this behavior.

I know that this error may affect only a small number of people as you need a high number of contigs to run into this problem, but it is still annoying. It is also most likely a bug in tbl2asn, but I think it can be easily prevented with a workaround in prokka. When manually changing the contig names back to the old contig naming scheme (instead of N1_1, it was N1_contig000001 in prokka 1.11.0) in the .fsa and .tbl file before running tbl2asn, the problem is gone. Is there a specific reason for the modified contig naming scheme or can it be reverted back to the old pattern to prevent this problem?

tseemann commented 4 years ago

As the start you said prokka 1.13.3 but your example is from 1.14.5 ?

That definitely looks like a bug in tbl2asn ! Do you really need a .gbk file?
You can delete the tbl2asn line from the prokka script and just use the GFF3

regelka commented 4 years ago

As the start you said prokka 1.13.3 but your example is from 1.14.5 ?

I'm sorry, that was written in a hurry and might be confusing. The first time I've seen this bug was with a new metagenome and prokka 1.13.3. Due to the lack of an assembly with an older prokka version and this new metagenome, I've now used an old metagenome originally assembled with prokka 1.11.0 as reference and reannotated it with prokka 1.13.3 (a while ago) and now for the bug report again with prokka 1.14.5. The erroneous output is identical in prokka 1.13.3 and 1.14.5 and I've just mentioned 1.13.3 to indicate the possibly first version containing this problem.

Do you really need a .gbk file?

Unfortunately, I need the .gbk file for subsequent pipelines. There is no (simple) way to modify these pipelines, leaving me at the moment only with the option to change the contig names by hand and run tbl2asn manually.

tseemann commented 4 years ago

There are various GFF to GBK scripts out there that you could use, assuming Prokka produces the GFF first, or you delete the tbl2asn line.

I don't think it is prokka causing the bug, I think tbl2asn is not coping with 72,000 contigs. I have no control over that, and in fact it will soon be gone from prokka.

Can you just break your metagenome contigs into batches of 5000, then run Prokka on each of those, and then just concatenate them? That's a valid multi Genbank file. Would be some issues with Gene IDs but it might get you most of where you need to go.

What metagenomic downstream tools needs GBK format?

regelka commented 4 years ago

I don't think it is prokka causing the bug, I think tbl2asn is not coping with 72,000 contigs. I have no control over that, and in fact it will soon be gone from prokka.

As said before: In previous prokka versions (at least up to 1.11.0), contigs were named N1_contig000001 and not N1_1 as done by recent versions. Just out of curiosity: What was the reason for changing the naming scheme? It triggers the tbl2asn bug (as it only happens in case of identical contig and gene IDs) and could probably be reverted back to its old pattern.

Can you just break your metagenome contigs into batches of 5000, then run Prokka on each of those, and then just concatenate them?

That could work in theory, but is problematic due to our limited resources. While I can get the necessary walltime on one of the more powerful systems to finish a large job like a full metagenome assembly/annotation, I would most likely have to wait longer when splitting the data to several jobs as the number of smaller servers is limited and they are heavily used. An upgrade and overhaul of the whole structure is planned, but will probably take another one or two years until it is done. It is not in my responsibility and I have no deeper access to the systems, so I have to stick to the rules and make the best of it.

What metagenomic downstream tools needs GBK format?

Some pipelines written for metatranscriptome analysis use the .gbk file as backbone for the transcriptome. Other tools like TraV2 (a transcriptome viewer) also depend on GBK data (again as backbone). Most of these tools are widely used in my department, but were written by a bioinformatician who left us a while ago. Due to limited programming skills (he was the only bioinformatician we had), we are currently stuck with the tools as they are and cannot easily modify them.

There are various GFF to GBK scripts out there that you could use, assuming Prokka produces the GFF first, or you delete the tbl2asn line.

I will have a look at them, is there any script you can recommend? The other option (currently used in my department) is to change the contig names in the .fsa and .tbl output to the old naming scheme and subsequently run tbl2asn, but a GFF to GBK converter might also be a solution.

tseemann commented 4 years ago

That could work in theory, but is problematic due to our limited resources. While I can get the necessary walltime on one of the more powerful systems to finish a large job like a full metagenome assembly/annotation, I would most likely have to wait longer when splitting the data to several jobs

I mean, in that one big job, break it up and run them one after the other - no need to submit separate jobs. Your "job" is just a plain shell script that runs prokka 100 times on the 100 chunks, serially.

In fact, most HPC queues give priority to jobs < 1 hour walltime, and low CPU count, as they can slot them into small holes everywhere.

You can use older versions of Prokka. The databases haven't changed that much really. Our increasing knowledge of protein functions is very slow... people fixated on "omics" now.

ealdraed commented 4 years ago

Hi @regelkamp !

Just out of curiosity: What was the reason for changing the naming scheme?

As can be seen here (https://github.com/tseemann/prokka/commit/92940bcd299dea710a17f2954045ea0eada9121c) the reason was to save space in the ID line.

You have several options here: (1) You could change back this line to 6-digit format with leading "contig". (2) You could just add a single letter "c" (for example) to the current format. (3) If your contigs are already named like "contigXXXXX" etc., you could remove the -centre (or -compliant) option and prokka would not touch the IDs.

tseemann / prokka

Large .gbk with wrong /translation sequence #441