Closed rndg20 closed 8 years ago
Sorry. You cannot use the output of Roary as the input to Gubbins (e.g. to detect recombination in the pan genome). They are fundamentally incompatible and we dont have a solution yet for this open problem.
Hi Andrew The original question by rndg20 was about the core genome. Your reply is about the pan genome. Can a core gene superalignment as created by roary be used to detect recombination events in this representation of the core genome (provided that Roary orders the genes)? If not, why not? what is the issue there?
Dealing with Gubbins specifically, it requires the full length of the genome, including intergenic regions, aligned back to a single reference to calculate the SNP density which it then uses to detect recombinations. The core genome alignment of roary however doesnt include intergenic regions (so lots of SNPs will be missing), and the length is a lot shorter than the full length of the genome (messing up the stats its based on), and the syntany isnt necessarily correct. If there has been a recombination with a different gene popped in, with aligners you would have missing data, but with Roary you would have 2 different clusters, both of which would be flagged as accessory genes (with a bubble in the graph). This makes Gubbins unsuitable for use with Roary.
It should be possible to detect recombinations in pan-genomes, but I dont know if its been solved yet.
I may have to disagree there. I don't think removal of the intergenic regions matters much. The molecular clock in the intergenic regions could even be different from that of genic regions, therefore removing it will have no effect, possible there's even a positive effect on detecting recombination regions (as you will be comparing only genic SNP densities). The comment about messing up the stats is unclear and I cannot see why that would happen.
Missing SNPs will be not much of an issue, as long as the order of the genes is as correct as possible. I can imagine though that if there's a huge intergenic region between 2 recombination regions, Gubbins would detect two recombination regions in a core genome alignment but only one in a core gene superalignment as it sees it as one region, could be problematic, but practically it does not really change the outcome. I do agree that for proper recombination detection, inversions etc should be taken into account per branching event on the tree, however no software does that as far as I know.
About the paralog splitting. It would result in missing data either way, like in aligners, with and without splitting paralogs ( the -s flag). (either is not part of the core gene superalignment as its not core, or its not part of the core gene superalignment because some genomes have duplicate entries).
Hi
I was wondering whether gubbins would be able to accurately detect recombination from a core gene alignment, implemented by Roary?