Closed alantsangmb closed 7 years ago
Your best to ignore the accessory_binary_genes.fa file. It is just for creating a quick and dirty tree with FastTree. The file itself is filtered to remove very common and not common variation to speedup the tree generation, hence the difference in numbers.
Thank you Andrew! I think the tree is still informative to me even it has been filtered.
Might I know the threshold for the very common and not common?
Sorry for the message on a closed thread. But is there a way to generate an accurate accessory gene binary dataset?
The top 5% and bottom 5% are excluded. It is truncated at 4000 genes. The code is here: https://github.com/sanger-pathogens/Roary/blob/master/lib/Bio/Roary/AccessoryBinaryFasta.pm
There are no parameters to modify this, and since its rough and ready tree to give you an idea about how things cluster. Even if you were to use all the accessory genes, it would still be quite inaccurate. If it really is something you want to do you can filter the gene presence and absense Rtab file, and build a tree from that ( since it will have all the data).
Great thanks Andrew.
Thanks!
Hi, I am analyzing 28 bacterial genomes using Roary.
And the summary statistics is as follow: Ortholog class Definition Count Core genes (99% <= strains <= 100%) 4930 Soft core genes (95% <= strains < 99%) 135 Shell genes (15% <= strains < 95%) 140 Cloud genes (0% <= strains < 15%) 163 Total genes (0% <= strains <= 100%) 5368
So there should be 438 accessory genes but the length of each sequence in accessory_binary_genes.fa is only 164 characters. And there are 383 sets of "variation" listed in the accessory.tab file.