tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
850 stars 226 forks source link

Missing short genes #14

Open sjackman opened 10 years ago

sjackman commented 10 years ago

Prodigal penalizes predicting genes that are shorter than 250 bp (83 aa). As a result, Prokka is missing a number of short proteins that do exist in a closely related species. Any thoughts on how to deal with this? By checking with prodigal -s I've learned that Prodigal is in fact predicting the genes, but they have a score < 0 and so are discarded.

sjackman commented 10 years ago

Changing the Prodigal short CDS penalty from 250 bp to 100 bp rescues 2 of the missing 9 short genes.

sjackman commented 10 years ago

Using --meta mode (-m anon in Prokka 2.7 from GitHub) saves 2 genes, and the combo of -m anon and reducing the short CDS penalty to 100 bp saves 6 of 9 short genes. Progress.

sjackman commented 10 years ago

See

tseemann commented 10 years ago

@sjackman I am thinking off adding a special database of well known small proteins. In Staph for example there is a 6aa "toxin" gene (!) which never gets found. By using a stricter glocal alignment (eg. glsearch36_t) this might make sense.

I've heard that there may exist databases of these things. This might be a start: http://compbio.cs.toronto.edu/psmdb/desc.html

If not, maybe we could infer one from records in Genbank?

tseemann commented 9 years ago

@sjackman I just went and looked at swissprot bacteria at non-fragment confirmed proteins, and there are about 4500 of them under 200aa long, of which about 1000 are under 100bp. I'm guessing Prodigal misses a lot of these. I may have to do something about this within Prokka.

ryu1013 commented 3 months ago

Dear @sjackman and @tseemann,

Has there been any update on handling issues related to missing short genes? I’m particularly interested in any recent changes or plans to address this.