tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
835 stars 226 forks source link

Use of wrong start codon for virus annotation #412

Closed jhayer closed 4 years ago

jhayer commented 5 years ago

Hi,

I have been running prokka on a coronavirus genome, using a protein set. For one of the proteins, the start codon predicted is wrong. It is not using the normal ATG but a GTG located 2 codons upstream (GTGAAAATG). This is weird as I added both options --kingdom (set to Viruses) and --gcode 1. I have tried using Prodigal alone and it seems that it is actually Prodigal that is ignoring the translation table and using this GTG as the start codon for predicting the ORF. Do you have any idea of how this could be fixed? Thanks, Juliette

tseemann commented 5 years ago

@jhayer you are right that Virus mode sets --gcode=1 but i think i think the problem is this: my $prodigal_mode = ($totalbp >= 100000 && !$metagenome) ? 'single' : 'meta'; your virus is short so changes it to metagenome mode which ignores genetic code setting. i guess this is a bug - i never really intended prokka for viruses. i will try and prevent this mode change but i think prodigal won't be happy.

you can try editing the 100000 in your prokka script and changing to to 1000 ?

would VIGOR be better? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2942859/

franciscozorrilla commented 4 years ago

While playing around with some covid19 genomes I ran into a similar issue as @jhayer . Unfortunately, from the VIGOR4 github page it seems it seems like Coronaviruses are not readily annotateable, although they may be taxonomically close enough to the supported viruses?

Annotatable Viruses

Vigor4 uses the VIGOR_DB project which currently has databases for the following viruses:

    Influenza (A & B for human, avian, and swine, and C for human)
    West Nile Virus
    Zika Virus
    Chikungunya Virus
    Eastern Equine Encephalitis Virus
    Respiratory Syncytial Virus
    Rotavirus
    Enterovirus
    Lassa Mammarenavirus

In any case, I edited the line suggested by @tseemann from 100000 to 100 (some of the covid19 genomes from ncbi are only a few hundred bp's) and this appears to have improved my prokka->roary results. Before editing the suggested line, roary only identified 8 total genes and failed to identify any core genes. After the edit:

Core genes  (99% <= strains <= 100%)    8
Soft core genes (95% <= strains < 99%)  2
Shell genes (15% <= strains < 95%)  1
Cloud genes (0% <= strains < 15%)   41
Total genes (0% <= strains <= 100%) 52
image
tseemann commented 4 years ago

Prokka is not designed for Viruses whatsoever.

hCoV is a complicated genome with a polyprotein and many mat_peptides, some of which arise during ribosomal slippage etc so can't be trivially annotated anyway.

If you submit your genome to NCBI (like we did) they will annotate it for you. They have a rapid submission process for hCoV.

joanmarticarreras commented 3 years ago

As follow up question, what ab initio gene prediction tool would you recommend for viruses?