tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
843 stars 226 forks source link

Question: Reducing hypothetical proteins #439

Closed YiJessePi closed 5 years ago

YiJessePi commented 5 years ago

Hi, I find high fraction of hypothetical proteins (~70%) using prokka when annotating my genome. Although tools like eggnog mapper succeeded to annotate higher fraction, of proteins using them is irrelevant since their long run time.
Is there a way to reduce the hypothetical proteins rate with similar run-time (same order of magnitude)? maybe like a adding additional db? do you have a recommended one?

[This make me wonder how prokka runs so fast while it blast proteins against 3 DBs and use HMM?...]

tseemann commented 5 years ago

@YiJessePi

  1. do you know what species/genus your genome is? if so, just download some complete Genbank genomes and use the --proteins option: https://github.com/tseemann/prokka/blob/master/README.md#option---proteins

  2. having 70% hypothetical is not unusual in less studied genomes. Even classic E.coli K12 has many unknown genes with 4 letter gene codes beginning with yXXX.

  3. you could use PGAP instead and get NCBI quality annotations. google 'ncbi pgap'