sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
314 stars 189 forks source link

Missing proteins in roary output #288

Closed thackl closed 7 years ago

thackl commented 7 years ago

Hi,

I am running roary (3.6.2) on a set of ~200 Prochlorococcus genomes. I annotated the genomes with Prokka, and did get 326414 proteins. However, when I run roary on the gffs, clustered_proteins only contains 320796 protein ids, i.e. 5618 proteins are missing from the final clusters.

The missing proteins are from different genomes, most of them are short, but not all. So far, I couldn't figure out a clear pattern.. Any idea, where/why the proteins might get lost?

andrewjpage commented 7 years ago

Hi, A few filters get applied which may account for the differences. Very short genes are excluded, as are genes which have stop codons in the middle (pseudogenes) etc... I would have to see some example data to be able to tell you exactly why. Andrew