tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
850 stars 226 forks source link

Annotate previously called genes #66

Closed donovan-h-parks closed 5 years ago

donovan-h-parks commented 9 years ago

It would be useful to be able to process previously called genes. This would allow Prokka to be used for annotating an arbitrary set of genes which could be from multiple organisms and/or simply called with a program other than Prodigal.

tseemann commented 9 years ago

This is similar to #63 and I think it is a useful idea.

How do you see your previously called genes existing? In a GFF file? A GBK file?

donovan-h-parks commented 9 years ago

For myself, I often have called genes in a flat fasta file that I would like to have annotate with Prokka. In the simplest case, this would just be genes called with a program other than Prodigal. For example, RAST is often used for gene calling and annotation, but I would personally like to be able to also annotate these same genes with Prokka.

tseemann commented 9 years ago

The --proteins option will do most of what you want. Of course if Prodigal misses one of the genes, it won't appear. And it might predict a different start codon.

There are 2 main ways to find genes (1) ab initio, like using Prodigal, or (2) by alignment of existing genes. Both have advantages. Prokka currently only does (1). I would like to do both (1) and (2).

What would you like to happen if your gene isn't fully intact or has a frame-shift in the contigs?

donovan-h-parks commented 9 years ago

My understanding is that the --proteins options is for providing additional proteins to annotate from (i.e., an additional annotation database). What I would like is a way to use prokka to annotate a set of existing genes using the databases and methodology of prokka. For example, this could be genes called with GLIMMER. I'd like to feed a fasta file of genes in amino acid space produced by GLIMMER into prokka to have them annotated. In essence, I'd simply like to skip gene calling with Prodigal in favour of GLIMMER.

tseemann commented 9 years ago

@dparks1134 If you want to supply an existing gene prediction, it can't really be as FASTA. It needs to coordinates (GFF) onto the exact same contigs. Otherwise we will rely on BLAST or sequence alignment to find them - and that brings all the problems of multiple hits with paralogs. What output does GLIMMER produce?

standage commented 5 years ago

Sorry to pick up this ancient thread, but is it possible to feed GFF3 of previously computed gene predictions to Prokka? I can't tell for certain from the current documentation, but it looks like this isn't an option.

dturaev commented 5 years ago

I would also appreciate this functionality. I'm also using other tools like InterProScan, and it would be nice to directly match the results with Prokka's annotation. (Of course it's possible to use Prokka's *.faa file as input for other tools, but it seems more efficient to do it the other way around.)

tseemann commented 5 years ago

No it's not possible, and won't be in the 1.x series. It's possible 2.x will support this feature. Prokka annotations are nowhere as good as InterProScan would give; Prokka is designed for speed.