sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
323 stars 189 forks source link

Roary+Gubbins! #330

Closed maesaar closed 7 years ago

maesaar commented 7 years ago

I have created/read the Issues page (issue (#267)) where it is said that core genome alignment from Roary is not suitable for Gubbins to detect recombination. But from time to time I find publications where following steps "Roary -> core alignment (PRANK) -> Gubbins" is used.

For example in PhD thesis (http://www.bib.fcien.edu.uy/files/etd/biol/uy24-18262.pdf) page 254 section 6.3.4 and 6.3.5 says following:

The core and accessory genomes of C. hypointestinalis were estimated using Roary [225] at 90% identity and 99% coverage. The concatenated core genes were aligned with PRANK [226] and Gubbins was used to remove recombinant blocks...

Second example publication (http://aem.asm.org/content/early/2016/04/04/AEM.00362-16.full.pdf+html) supplementary material (http://aem.asm.org/content/suppl/2016/05/19/AEM.00362-16.DCSupplemental/zam999117195so1.pdf) page 2 Figure S1 says following:

Core gene SNP phylogenetic tree. A core gene alignment was performed with Roary (1). Potential recombination was removed from the core gene alignment using Gubbins (2) and a final maximum likelihood phylogeny SNP tree is shown.

Are these publications methodologically sound?

andrewjpage commented 7 years ago

Its certainly strongly discouraged! I would say they are missing SNPs/data from the alignment that they used to build their tree. The real question is if these missing SNPs actually makes any real difference to the final result?

maesaar commented 7 years ago

So it depends on the lost SNPs - thanks.

GuilhemRoyer commented 4 years ago

Hi Andrew,

I had a question about those comments. Do you think that it could be suitable to do the following steps to overcome these issues with Roary-based alignement:

1.Run gubbins on each core-gene alignment independently (i.e. if 2000 core genes ==> 2000 independent alignements ==> 2000 gubbins runs)

  1. Then concatenate each *.filtered_polymorphic_sites.fasta for each strain
  2. Perform a phylogenetic analysis from these recombination-free concatenated sequences ?

Thanks!

Guilhem

andrewjpage commented 4 years ago

Hi Guilhem, Sorry I'm afraid thats not going to work. Gubbins needs to use the whole genome to detect regions of increased SNP density and doesn't work on a small scale (like the gene level). In a pan genome context, recombination will probably be represented as different clusters in the accessory genome rather than being in the core. Regards, Andrew

GuilhemRoyer commented 4 years ago

It's also what I was afraid of, but I was not sure. Thank you for your answer and the useful comment on core gene recombination !