Closed jhayer closed 4 years ago
@jhayer
you are right that Virus mode sets --gcode=1
but i think i think the problem is this:
my $prodigal_mode = ($totalbp >= 100000 && !$metagenome) ? 'single' : 'meta';
your virus is short so changes it to metagenome mode which ignores genetic code setting.
i guess this is a bug - i never really intended prokka for viruses.
i will try and prevent this mode change but i think prodigal won't be happy.
you can try editing the 100000 in your prokka script and changing to to 1000 ?
would VIGOR be better? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2942859/
While playing around with some covid19 genomes I ran into a similar issue as @jhayer . Unfortunately, from the VIGOR4 github page it seems it seems like Coronaviruses are not readily annotateable, although they may be taxonomically close enough to the supported viruses?
Annotatable Viruses
Vigor4 uses the VIGOR_DB project which currently has databases for the following viruses:
Influenza (A & B for human, avian, and swine, and C for human)
West Nile Virus
Zika Virus
Chikungunya Virus
Eastern Equine Encephalitis Virus
Respiratory Syncytial Virus
Rotavirus
Enterovirus
Lassa Mammarenavirus
In any case, I edited the line suggested by @tseemann from 100000 to 100 (some of the covid19 genomes from ncbi are only a few hundred bp's) and this appears to have improved my prokka->roary results. Before editing the suggested line, roary only identified 8 total genes and failed to identify any core genes. After the edit:
Core genes (99% <= strains <= 100%) 8
Soft core genes (95% <= strains < 99%) 2
Shell genes (15% <= strains < 95%) 1
Cloud genes (0% <= strains < 15%) 41
Total genes (0% <= strains <= 100%) 52
Prokka is not designed for Viruses whatsoever.
hCoV is a complicated genome with a polyprotein and many mat_peptides, some of which arise during ribosomal slippage etc so can't be trivially annotated anyway.
If you submit your genome to NCBI (like we did) they will annotate it for you. They have a rapid submission process for hCoV.
As follow up question, what ab initio gene prediction tool would you recommend for viruses?
Hi,
I have been running prokka on a coronavirus genome, using a protein set. For one of the proteins, the start codon predicted is wrong. It is not using the normal ATG but a GTG located 2 codons upstream (GTGAAAATG). This is weird as I added both options --kingdom (set to Viruses) and --gcode 1. I have tried using Prodigal alone and it seems that it is actually Prodigal that is ignoring the translation table and using this GTG as the start codon for predicting the ORF. Do you have any idea of how this could be fixed? Thanks, Juliette