tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
834 stars 226 forks source link

Partial v Complete Genes for Metagenomic Analysis #283

Open JChristopherEllis opened 6 years ago

JChristopherEllis commented 6 years ago

Hi,

I would like to see the partial genes and the complete genes when performing metagenomic analysis. Is there a way to identify both?

Thanks, micromania

tseemann commented 6 years ago

What do you mean by partial genes? True pseudo/broken genes? Genes articifically broken by contigs ends? Or by mis-assemblies?

JChristopherEllis commented 6 years ago

Sorry for the confusion. I am referring to genes that are artificially broken by contig ends.

JChristopherEllis commented 6 years ago

Or really anything that would yield a partial protein product. I would like to be able to tell the difference between partial protein sequences and my complete proteins sequences in my metagenomic data.

jvollme commented 6 years ago

Hi micromania2,

I had the same interest (specifically for metagenomic bins and single cell genomes). Since I had the impression that nobody else wanted that feature i simply slightly modified the "prodigal" call of my locally installed prokka version for this.

The way prokka originally calls the ORF-caller prodigal is with the "-c" argument (for "closed ends"), which won't let ORFs run over the contig ends. In order to remove this, you simply have to edit line 961 in the prokka script from:

my $cmd = "prodigal -i \Q$outdir/$prefix.fna\E -c -m -g $gcode -p $prodigal_mode -f sco -q";

to

my $cmd = "prodigal -i \Q$outdir/$prefix.fna\E -m -g $gcode -p $prodigal_mode -f sco -q"; #removed "-c" argument

Now you will also get all genes that are artificially broken by contig ends (however they will not be specifically marked as such.)

(Edit: this is related to #88 btw)

tseemann commented 6 years ago

I agree that Prokka should be allowing partial genes AND annotating them as such.

I am rethinking the whole design of Prokka, esp in terms of metagenomes.

novigit commented 6 years ago

Hi! Just would like to mention that me, @jennahd and @lguy have submitted pull request #219 a while ago, that deals exactly with the problem of partial genes at contig edges.

Simply changing the prodigal line to add the -c flag is not enough! For example, in the resulting GenBank files, gene coordinates should be annotated with '<1' (if partial at the start of the contig) or '>5234' (if partial at the end of a contig with length 5234). We recommend using this version with the flag (--partialgenes), which should deal with the problem automatically!

Hope the pull request will be implemented in the main software at some point.

JChristopherEllis commented 6 years ago

I went back and used prodigal to differentiate the full length genes from fragmented genes. I then separated them into two files one with full length genes and the other file with only fragmented putative genes. I used these two files to pass back through prokka for functional annotation.

The full length putative genes worked well with almost all of them functionally annotated when passed back through prokka.

However, for the fragmented genes only about 1/3 of them were identified with Prokka. I think this may be an issue with the options I am using, is there something I could be doing differently to restore the functional annotation calls to what they were without parsing fragment and full length sequences into separate files?

Here is the command line...

prokka --outdir

--prefix --notrna --metagenome --cpus <#CPUs> --addgenes

ankeetkumar commented 1 year ago

Hi! Just would like to mention that me, @jennahd and @lguy have submitted pull request #219 a while ago, that deals exactly with the problem of partial genes at contig edges.

Simply changing the prodigal line to add the -c flag is not enough! For example, in the resulting GenBank files, gene coordinates should be annotated with '<1' (if partial at the start of the contig) or '>5234' (if partial at the end of a contig with length 5234). We recommend using this version with the flag (--partialgenes), which should deal with the problem automatically!

Hope the pull request will be implemented in the main software at some point.

Dear Sir,

I am trying to annotate a viral genome, and due to the lack of coverage, most of my genes are partial.

When I am making submissions to Bankit I am getting an error that says the gene starts with downstream methionine and I haven't labelled partial genes.

How do I add that flag of partial to the genomes which are partial? Also, I see sometimes Prokka breaks the genes into two and labels the genes as Gene1_1 and Gene1_2. How to solve that?

Thank you in advance.

Regards, Ankeet