sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
323 stars 189 forks source link

What clusters end up in gene accessory_binary_genes.fa ? #225

Closed tseemann closed 8 years ago

tseemann commented 8 years ago

The manual says:

First of all we construct a FASTA file with the binary presence and absence of genes, where 'A' means a gene is present and 'C' means it is absent. Only the first 4000 genes in the accessory genome are considered

Based on some data we have tried, it seems that singleton clusters do NOT end up in the file?

eg. 10 samples, mostly clonal, but 1 with a plasmid, causes tree to be mostly flat, no sites in the .fa file for the plasmid genes.

tseemann commented 8 years ago

I've found this code: https://github.com/sanger-pathogens/Roary/blob/056512409fcb0e817cf16ae554792816b80b9356/lib/Bio/Roary/AccessoryBinaryFasta.pm

And it seems besides the 4000 gene limit, there is some 5% upper and lower bound, which i assume trims clusters that have membership numbers too low or too high?

Is there a way to script / parameter this from the command line tools?

andrewjpage commented 8 years ago

Hi Torsten,

I wanted to cap the size of the file sent into FastTree since it can be memory hungry. Running a few tests, I think I may have been a bit too cautious here. My original thinking was to focus on getting the general high level groupings in a reasonable order (hence getting rid of the top and bottom 5%). I'll remove this restriction and see how things go. Andrew

On 20 January 2016 at 01:56, Torsten Seemann notifications@github.com wrote:

I've found this code: https://github.com/sanger-pathogens/Roary/blob/056512409fcb0e817cf16ae554792816b80b9356/lib/Bio/Roary/AccessoryBinaryFasta.pm

And it seems besides the 4000 gene limit, there is some 5% upper and lower bound, which i assume trims clusters that have membership numbers too low or too high?

Is there a way to script / parameter this from the command line tools?

— Reply to this email directly or view it on GitHub https://github.com/sanger-pathogens/Roary/issues/225#issuecomment-173059231 .

tseemann commented 8 years ago

Thanks!
Nullarbor now produces pan-genome trees and they seem to have more resolution now.

jacorvar commented 5 years ago

Hi @andrewjpage ,

is it already possible to get the accessory_binary_genes.fa from the gene_presence_absence.csv file using any script from the command line?

Thanks