tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
849 stars 226 forks source link

Adding support for partial genes #88

Open aleimba opened 9 years ago

aleimba commented 9 years ago

I would like to bring up the issue again to include partial genes. See the pull request (#37) from @lguy where he implemented it. I think it would be good for draft genomes and even more for highly fractionated metagenomes.

Also, as @sjackman observed, the modes of Prodigal are now called differently (#16). It might be useful to change 'single' to 'normal' and 'meta' to 'anon' in line 664 to not confuse users who look up the Prodigal docs.

EDIT: Just realized the newest Prodigal version still states 'single' and 'meta' in its command-line help text ... The Wiki however has only the new terms (https://github.com/hyattpd/prodigal/wiki/cheat-sheet and https://github.com/hyattpd/prodigal/wiki/Gene-Prediction-Modes)

aleimba commented 9 years ago

@hyattpd just answered in correspondence to the mode name changes. They'll be implemented from Prodigal v3.0.0 forward, the Wiki already has the new names in preparation for v3.x (hyattpd/Prodigal#11). My bad.

hyattpd commented 9 years ago

I haven't really followed this discussion, but I would not recommend the -c option for anything except finished chromosomes. With prokaryotic genomes being 85% coding, the likelihood of a partial gene running off either edge is extremely high (85% likely the edge bases are inside genes, less % that you have at least 60bp of coding). You're going to miss more than half the genes in some data sets (those with only small contigs) using -c.

Just as an example. I have a data set that has E. coli randomly sampled in thousands of 1200bp contigs, and the coordinates of the Genbank-annotated genes in those contigs.

With the -c option on, you'd miss at least half those genes (the ones missing stop codons). The ones missing start codons would be truncated and you'd be reporting less of the protein than you actually could be.

aleimba commented 9 years ago

thanks for your insight, @hyattpd. Although I think it's very unlikely that you'll get a bacterial draft assembly with a maximum contig length of 1200bp, you'll definitely miss out on genes in draft genomes and especially metagenomes. Of course, as you said, depending on the amount of small contigs.

hyattpd commented 9 years ago

Even in long contigs, there will be a partial gene at each edge ~75% of the time, so you wind up missing ~1.5 genes per contig. I guess it's a question of how much one cares about partial proteins. I think it is better just to always call Prodigal without -c, and have an option to Prokka to not report partial genes below some length (rather than passing this option on to Prodigal).

tseemann commented 9 years ago

@hyattpd I had considered the -c option originally when implementing Prokka and agreed with @aleimba but I am having second thoughts now.

In general, genome assemblies break at repeats that are longer than the read length or span. In bacteria this is nearly always duplicate / paralogous genes, such as rRNA islands and insertion sequences. The break often occurs in the intergenic region, and the repeated gene gets its own contig.
An N50 of 1200bp is rare in modern bacterial genomics, and would simply be discarded.

But I think the idea of post-filtering is a good one. I will think more about it.

aleimba commented 9 years ago

I agree with @tseemann on gaps in repeats. I've also made the experience (through cumbersome manual gap finishing) that many short-read assembler don't cope very well at contig ends, with overlaps to other contigs, wrong solution of repeats etc.

Apart from that, including partial genes in Prokka would be a big plus and thanks to @tseemann for taking it up! As you mentioned a test suite would be sweet for future code integration, but sadly I have no experience in that. I'm guessing it's quite some overhead.