harry-thorpe / piggy

Pipeline for analysing intergenic regions in bacteria
GNU General Public License v3.0
37 stars 7 forks source link

GFF files not produced by prokka might not be handled properly #15

Closed mgalardini closed 7 years ago

mgalardini commented 7 years ago

Hi,

as reported in the previous issue I opened (#14), I'm running piggy on ~700 E. coli genomes. All but one have been annotated by Prokka. The only exception is the reference strain, E. coli K-12, for which I'm using the genbank file available from the NCBI and converted to GFF3 format using this python library.

It would appear that piggy is not picking up the gene names for the gff file obtained this way (downloadable here). If I look into the IGR_sequences.fasta file generated by piggy I see the following FASTA headers:

genome_+_+__+_+__+_+_DP

Adding an ID feature to each sequence feature with value equals to the locus_tag seems to solve the issue, but I was wondering whether using the more commonly used locus_tag attribute would make more sense.

Marco

harry-thorpe commented 7 years ago

Hi Marco,

Thanks again for the feedback. This is a bit complicated unfortunately. I use the ID as this is what Roary relies on (I think).

If Roary encounters duplicate IDs or locus_tags between isolates, it modifies the GFF files by adding a number to the end of the ID (but not the locus_tag). These files are found in 'fixed_input_files' in the Roary out dir. Because Piggy integrates the information from Roary, the gene names must be identical to the ones used by Roary otherwise it doesn't work. So I have to use Roary's ID information.

If there are duplicates between genomes (and no ID), then Roary takes the locus_tag for the first genome, and then adds an ID (with appended number to make unique) to the other genomes (but not the first). I have just tested this on your genome by renaming genome1, 2, 3 etc (Roary need 3 genomes minimum to run). So in this example genome1 has no ID, but the others have unique IDs.

I think this means if the genome has no ID then it is safe to use the locus_tag (as long as files in the fixed_input_files folder get priority - these should all have IDs).

Does this make sense?

Thanks,

Harry

mgalardini commented 7 years ago

Hi Harry,

I understand; I have modified my GFF file to include an ID field equal to the locus_tag and it worked fine. Thanks a lot for your feedback.

Best, Marco