katholt / RedDog

33 stars 4 forks source link

Identical output files from phylogeny run #22

Closed prmcadam closed 9 years ago

prmcadam commented 9 years ago

For a phylogeny run the CoverMatrix.csv and GeneSummary.csv files are duplicates of each other (identical md5sums).

d-j-e commented 9 years ago

For a phylogeny run the CoverMatrix.csv and GeneSummary.csv files APPEAR to be duplicates of each other (certainly can have identical md5sums - i.e. no changes to output apart from file name).

Need to code test to see if this script is really doing what it is supposed to...

d-j-e commented 9 years ago

Just had a look - this script only does one thing: counts the number of strains that have a coverage of greater than 95% (option -c cutoff) and outputs this to the genewise summary file - PresenceAbsence.csv At the moment the coverage is just report 'as is'... [so your observation about the md5sums is correct, Paul, as is the one that they are indeed identical!]

However, this is not what this script is supposed to do. Description from file: read in coverage output file (% coverage for each gene in mapped pan genome) remove genes that are not covered (to set % level, default 95%) in any strains generate presence/absence table (1/0, based on set % level, default 95%) generate summary of genes, reporting total number of strains with the gene FUTURE: could also provide strain IDs and groups, and summarize presence by group could also calculate pan and core rarefaction curve data could also accept depth file and use this along with % coverage to call presence/absence could also accept RAST annotation file and output product identifiers etc along with genes (esp in gene summary table)

Guess this is suddenly priority one for fixing/developing...

d-j-e commented 9 years ago

Sorry about the formatting of the last message - does not show up in edit mode at all! - worked it out (bleepin' markdown!)

d-j-e commented 9 years ago

The outputs are now correct (v0.5.1) - the next version of RedDog (v0.5.2) will include the expanded parseGeneContent script which includes plotting the core/accessory genome, and depth filtering.