Open sjackman opened 10 years ago
Changing the Prodigal short CDS penalty from 250 bp to 100 bp rescues 2 of the missing 9 short genes.
Using --meta
mode (-m anon
in Prokka 2.7 from GitHub) saves 2 genes, and the combo of -m anon
and reducing the short CDS penalty to 100 bp saves 6 of 9 short genes. Progress.
@sjackman I am thinking off adding a special database of well known small proteins. In Staph for example there is a 6aa "toxin" gene (!) which never gets found. By using a stricter glocal alignment (eg. glsearch36_t) this might make sense.
I've heard that there may exist databases of these things. This might be a start: http://compbio.cs.toronto.edu/psmdb/desc.html
If not, maybe we could infer one from records in Genbank?
@sjackman I just went and looked at swissprot bacteria at non-fragment confirmed proteins, and there are about 4500 of them under 200aa long, of which about 1000 are under 100bp. I'm guessing Prodigal misses a lot of these. I may have to do something about this within Prokka.
Dear @sjackman and @tseemann,
Has there been any update on handling issues related to missing short genes? I’m particularly interested in any recent changes or plans to address this.
Prodigal penalizes predicting genes that are shorter than 250 bp (83 aa). As a result, Prokka is missing a number of short proteins that do exist in a closely related species. Any thoughts on how to deal with this? By checking with
prodigal -s
I've learned that Prodigal is in fact predicting the genes, but they have a score < 0 and so are discarded.