Closed mgalardini closed 4 years ago
Hi, sorry for the slow response and thanks for flagging this.
Yes, at the moment we have favoured throwing an error when unusual annotations are included. There is an included script which we use to deal with this situation which essentially removes them as you suggested. However, given the number of times this has come up I am starting to think it may be worth including an option in the main pipeline.
In case it is useful the script can be found in scripts/convert_refseq_to_prokka_gff.py
Hi! Thanks, I did not notice that script, very useful!
In version 1.2.4, we have included the option --remove-invalid-genes
which filters out annotations of incorrect length or that contain premature stop codons.
Hi, thanks for this impressive piece of work. I'm running panaroo on a set of E. coli genomes, which have been annotated by prokka with the exception of E. coli K-12 (find the GFF here). Since it contains many pseudogenes, the preprocessing fails when those genes are checked that their length is of multiples of 3. Furthermore, there seems to be a non-pseudo gene (locus_tag
b2891
) which length is not a multiple of 3 but which has the following note:an in-frame premature UGA termination codon is located within the prfB sequence, and a naturally occuring +1 frameshift is required for synthesis of RF-2
. The translation is provided under thetranslation
key of the GFF entry.The quick solution is to remove those entries, or reannotate the genome with prokka, but maybe knowing these corner cases might be of help in case you are interested in covering them.
Thanks for your great work!