gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
269 stars 34 forks source link

GFF preprocessing fails on E. coli K-12 #67

Closed mgalardini closed 4 years ago

mgalardini commented 4 years ago

Hi, thanks for this impressive piece of work. I'm running panaroo on a set of E. coli genomes, which have been annotated by prokka with the exception of E. coli K-12 (find the GFF here). Since it contains many pseudogenes, the preprocessing fails when those genes are checked that their length is of multiples of 3. Furthermore, there seems to be a non-pseudo gene (locus_tag b2891) which length is not a multiple of 3 but which has the following note: an in-frame premature UGA termination codon is located within the prfB sequence, and a naturally occuring +1 frameshift is required for synthesis of RF-2. The translation is provided under the translation key of the GFF entry.

The quick solution is to remove those entries, or reannotate the genome with prokka, but maybe knowing these corner cases might be of help in case you are interested in covering them.

Thanks for your great work!

gtonkinhill commented 4 years ago

Hi, sorry for the slow response and thanks for flagging this.

Yes, at the moment we have favoured throwing an error when unusual annotations are included. There is an included script which we use to deal with this situation which essentially removes them as you suggested. However, given the number of times this has come up I am starting to think it may be worth including an option in the main pipeline.

In case it is useful the script can be found in scripts/convert_refseq_to_prokka_gff.py

mgalardini commented 4 years ago

Hi! Thanks, I did not notice that script, very useful!

gtonkinhill commented 4 years ago

In version 1.2.4, we have included the option --remove-invalid-genes which filters out annotations of incorrect length or that contain premature stop codons.