'--proteins' reference introduces too many "paralogs"

aleimba commented 10 years ago

I noticed that if you use a reference genome for annotation with option '--proteins', lots of false paralogs will be annotated (gene names which are numerated with an underscore '_'; see hash %collide in lines https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L985-1013).

For annotation Prokka just takes the best BLASTP hit (lowest evalue). Might it be useful to have some more options to include the possibility of subject/query coverage/identity cutoffs, in addition to option '--evalue'?

For this purpose BioPerl includes blast HSP tiling via the 'frac*' methods, which can be used to skip BLASTP results which don't satisfy these restraints, e.g.: $hit->frac_identical('query') $hit->frac_aligned_hit These could be included in the BLASTP parsing routine (line https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L921 onwards).

aleimba commented 9 years ago

see #67 on a related issue.

tseemann commented 9 years ago

@aleimba what you ask for is quite reasonable.

I originally did have something like this using those BioPerl functions but for highly fragmented assemblies and metagenomes it would not annotate partial genes.

I think it only happens when there is >1 equally good scoring local alignments (ie exact matches) so that BLAST always presents them in the same fixed order, so it gets the same hit ID for each paralog?

aleimba commented 9 years ago

@tseemann, thanks a lot for going through all the issues and implementing them in the new Prokka version. Great work! Actually what I did was annotate genomes, basically an annotation transfer, with one reference genome (it's E. coli so I used MG1655 as reference) and then curated this annotation extensively (unfortunately lots of manual work). Actually I have some numbers for you, gene names with underscore "_" (I included Proteinortho paralogue predictions, with options -synteny/-selfblast/-singles):

Strain	Prokka with MG1655 ref	After manual curation	Proteinortho (c70/i70)	Proteinortho (c90/i90)
Genome 1	445	88	226	125
Genome 2	406	63	150	66

So the problem is actually not that there's ">1 equally good scoring local alignments (ie exact matches)", but the inherent data structure in Prokka. Prokka first will use the --proteins reference annotations and all CDS which have a hit < e-value will be annotated. If you use only a small user supplied database for annotation there will be lots spurious hits which still are < e-value. Unfortunately these CDS will then be skipped in the subsequent databases, Uniprot, HAMAP, etc. even if they have a much better hit there.

So, I would suggest to include the coverage/identity methods only for user supplied (--proteins) reference annotations. This fix would then not touch annotation of partial genes in highly fragmented assemblies/metagenomes.

aleimba commented 9 years ago

I forgot to mention, that a synteny-aware annotation transfer would of course be another possibility to handle the paralog issues (as implemented in RATT, http://ratt.sourceforge.net/, or Proteinortho, http://www.bioinf.uni-leipzig.de/Software/proteinortho/). Especially, since Prokka can now work with GBK files directly as input for --proteins. But that's a whole other story, I guess ...

tseemann commented 9 years ago

@aleimba Whole genome alignment, followed by annotation "liftover", followed by ab initio annotation would be the best way to do related genomes.

Geoff Winsor from pseudomonas.org is having similar problems to what you describe.

Giving the tiny first-pass database (and --proteins) is where Prokka gets most of its speed gain. I probably should have designed it better to easily "tune the dial" between fast and accurate. I think the --min-cov option would make a difference.

You may notice this line in Prokka which has been commented out for a long time ;-)

#    {OPT=>"coverage=f",  VAR=>\$coverage, DEFAULT=>0, DESC=>"Similarity \%id cut-off (otherwise /pseudo)"},

aleimba commented 9 years ago

Hi @tseemann, yes your suggestion sounds good. However, I'm not aware of an annotation pipeline that works with whole genome alignment, well except RATT with NUCmer. I tried it, but I didn't like it that much, E. coli is just too genomically flexible (i.e. very similar core genome, lots of flexible genome). I do like Prokka much better though, especially since its fully featured and runs locally.

In my case, I don't care so much about speed and wouldn't mind if Prokka ran for several hours per genome. But you're right, there will always be a tradeoff between speed and accuracy and right now Prokka does have amazing speed. You can't make everybody happy.

I probably worry unreasonably much about annotation. But, some E. coli annotations in published genomes are just awful. I'm sure you've had lots of experience with the propagation of wrong annotations. Emily Richardson and Mick Watson had a nice review touching on that subject PMID: 22408191.

Anyway, the --min-cov option sounds great and thanks for your time!

tseemann commented 5 years ago

Prokka now has the option:

  --coverage [n.n]   Minimum coverage on query protein (default '80')

tseemann / prokka

'--proteins' reference introduces too many "paralogs" #31