tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
819 stars 222 forks source link

Error: "Not using a valid terminator codon!" when annotating small contigs #112

Open symPiotr opened 9 years ago

symPiotr commented 9 years ago

Hi, I am working with very small bacterial genomes, and when trying to annotate the smallest of them using the latest Prokka I came upon a curious error.

Contigs above a certain length (120kb is OK) annotate correctly. But when the total length of contig/contigs in the input fasta file is below some threshold (75kb is too small), Prokka produces a series of warning messages:

--------------------- WARNING ---------------------

MSG: Seq [Contig_ID]: Not using a valid terminator codon!

...and seems to change the genetic code, leading to incorrect annotation. However, increasing the size of the input fasta file - either by duplicating all contigs, or by attaching a long polyA tail to one of the sequences - fixes the problem: annotation is then correct. Likewise, if the 120kb contig which annotates correctly is cropped so that only half remains, the problem described above appears.

This is the command that I have been using: prokka1 --force --outdir /my_path/TEST_annotation --prefix XXX --gcode 4 --kingdom Bacteria --rfam --addgenes --locustag XXX TEST.fasta

Let me know if you would like to see the input/output files.

And thanks for maintaining this great software!

Piotr

tseemann commented 9 years ago
  1. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (transl_table=4)

Bacteria: The code is used in Entomoplasmatales and Mycoplasmatales (Bove et al. 1989). The situation in the Acholeplasmatales is unclear. Based on a study of ribosomal protein genes, it had been concluded that UGA does not code for tryptophan in plant-pathogenic mycoplasma-like organisms (MLO) and the Acholeplasmataceae (Lim and Sears, 1992) and there seems to be only a single tRNA-CCA for tryptophan in Acholeplasma laidlawii (Tanaka et al. 1989). In contrast, in a study of codon usage in Phytoplasmas, it was found that 30 out of 78 ORFs analyzed translated better with code 4 (UGA for tryptophan) than with code 11 while the remainder showed no differences between the two codes (Melamed et al. 2003). In addition, the coding reassignment of UGA Stop --> Trp can be found in an alpha-proteobacterial symbiont of cicadas: Candidatus Hodgkinia cicadicola (McCutcheon et al. 2009).

tseemann commented 9 years ago

@piotrlukasik Piotr: I think the error message is from within BioPerl, but I need to dig deeper. Can you please send the TEST.fasta file to torsten.seemann@gmail.com or just drag it into the comment here on github?

tseemann commented 8 years ago

The problem is that when the total bp is under 100,000 Prokka puts prodigal in "meta" mode.

This seems to have issues with different genetic codes.

hyattpd commented 8 years ago

Prodigal shouldn't be run with meta mode if you know the genome (you can't specify a genetic code with meta mode). The correct way to run a draft genome is not to submit a contig at a time, but to submit a multiple FASTA containing all the contigs (which Prodigal will train on and produce good results). This is covered in detail here:

https://github.com/hyattpd/prodigal/wiki/Advice-by-Input-Type

hyattpd commented 8 years ago

I suppose in a future version I could allow the specification of a genetic code for meta/anonymous mode (and tell it to skip the canned training files that don't match the genetic code). You would still get worse results running individual small contigs through metagenomic/anonymous mode compared to just putting all the contigs of a single genome in one file and using the default mode.

symPiotr commented 8 years ago

Hi, thank you for this! I guess that my problem was due to exceptional nature of my study organism: it does use an alternative genetic code, and the complete genome of some strains is less than 100kb. I don't think that many other people face this issue... Before running prokka on another set of similar genomes, I will make sure to read through advice about Prokka modes carefully!

hyattpd commented 8 years ago

You can run it using Prodigal with default settings if it's less than 100KB and specify the genetic code. <100KB isn't ideal since it's not as much sequence to train on, but it will still work. <20KB is the threshold at which the program refuses to run w/o a training file or metagenomic mode.