nickjcroucher / gubbins

Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
http://nickjcroucher.github.io/gubbins/
GNU General Public License v2.0
169 stars 49 forks source link

Is it possible to use gubbins to remove recombination from a core gene alignment? #169

Closed rndg20 closed 8 years ago

rndg20 commented 8 years ago

Hi

I was wondering whether gubbins would be able to accurately detect recombination from a core gene alignment, implemented by Roary?

andrewjpage commented 8 years ago

Sorry. You cannot use the output of Roary as the input to Gubbins (e.g. to detect recombination in the pan genome). They are fundamentally incompatible and we dont have a solution yet for this open problem.

aldertzomer commented 8 years ago

Hi Andrew The original question by rndg20 was about the core genome. Your reply is about the pan genome. Can a core gene superalignment as created by roary be used to detect recombination events in this representation of the core genome (provided that Roary orders the genes)? If not, why not? what is the issue there?

andrewjpage commented 8 years ago

Dealing with Gubbins specifically, it requires the full length of the genome, including intergenic regions, aligned back to a single reference to calculate the SNP density which it then uses to detect recombinations. The core genome alignment of roary however doesnt include intergenic regions (so lots of SNPs will be missing), and the length is a lot shorter than the full length of the genome (messing up the stats its based on), and the syntany isnt necessarily correct. If there has been a recombination with a different gene popped in, with aligners you would have missing data, but with Roary you would have 2 different clusters, both of which would be flagged as accessory genes (with a bubble in the graph). This makes Gubbins unsuitable for use with Roary.

It should be possible to detect recombinations in pan-genomes, but I dont know if its been solved yet.

aldertzomer commented 8 years ago

I may have to disagree there. I don't think removal of the intergenic regions matters much. The molecular clock in the intergenic regions could even be different from that of genic regions, therefore removing it will have no effect, possible there's even a positive effect on detecting recombination regions (as you will be comparing only genic SNP densities). The comment about messing up the stats is unclear and I cannot see why that would happen.

Missing SNPs will be not much of an issue, as long as the order of the genes is as correct as possible. I can imagine though that if there's a huge intergenic region between 2 recombination regions, Gubbins would detect two recombination regions in a core genome alignment but only one in a core gene superalignment as it sees it as one region, could be problematic, but practically it does not really change the outcome. I do agree that for proper recombination detection, inversions etc should be taken into account per branching event on the tree, however no software does that as far as I know.

About the paralog splitting. It would result in missing data either way, like in aligners, with and without splitting paralogs ( the -s flag). (either is not part of the core gene superalignment as its not core, or its not part of the core gene superalignment because some genomes have duplicate entries).