harry-thorpe / piggy

Pipeline for analysing intergenic regions in bacteria
GNU General Public License v3.0
37 stars 7 forks source link

More documentation needed for IGR_presence_absence file #14

Closed mgalardini closed 7 years ago

mgalardini commented 7 years ago

Hi,

first of all, thanks a lot for this great tool. I am testing it on a set of ~700 E. coli genomes. Producing the IGR_presence_absence.csv file took approximately 14 hours using 20 CPUs.

I know that the presence_absence file mimics Roary's format, but I was wondering whether you could shed light on what the information stored in each cell means.

Example: genome_+_+_gene1_+_+_gene2_+_+_CO_R

I figured that genome, gene1 and gene2 represent the target genome and the genes flanking the IGR region of interest, but so far I could not figure out the meaning of the +_+ bit.

Also, I noticed that at the end of the string either of the following strings can be present: CO_F, CO_R, DP, DT, NA Is any documentation available to understand what those notations mean?

Thanks a lot for your help.

harry-thorpe commented 7 years ago

Hi Marco,

Thanks for the interest. I have updated the documentation on the readme. Hopefully this explains everything, but if not just let me know. Briefly, they describe the gene orientation information, and the ++_ is just a delimiter (a bit crude).

Thanks,

Harry

mgalardini commented 7 years ago

Great, thanks for the quick reply, it is much clearer now!