tseemann / snippy

:scissors: :zap: Rapid haploid variant calling and core genome alignment
GNU General Public License v2.0
468 stars 114 forks source link

Core genome alignment on a large number of isolates? #379

Open apredeus opened 4 years ago

apredeus commented 4 years ago

Hello Torsten,

I wanted to see how much difference does it make to use different alignment strategies for my generated phylogeny (I've used core gene alignment from Roary before). However, default settings on a set of ~ 3000 Salmonella Enteritidis genomes generated core alignment of exactly 2 (out of ~ 4.6M) positions.

Is this normal? If you include non-AGCT characters (Ns and "-"s), you get about 50k, which is similar to what I've seen from Roary/mafft, but a lot higher per-sample "missing" element then in Roary (I guess that's what happens if you have a reference-based strategy, right)? Otherwise, what would you say should be the best way to generate core genome alignment for thousands of sequences?

Thank you, as always.

tseemann commented 4 years ago

When you start working with large number of isolates, quality control (QC) becomes very important. 1% of those samples will have bad data, and ruin your analysis. You need to be ruthless and remove anything strange.

Nullarbor has a make preview mode which generates a mash tree. Any outliers on that tree should normally be excluded immediately. Other things to use are Kraken results, and sequencing yields.

Snippy + SnippyCore will give better results than using Roary concatenated alignments in general, because it works at read level, rather than assembly+annotation level.

That said, too much divergence in your 3000 samples will ruin most analyses. Can you break into MLST or Serovar?

apredeus commented 4 years ago

Thank you for all suggestions - much appreciated!

That said, too much divergence in your 3000 samples will ruin most analyses. Can you break into MLST or Serovar?

They are all Enteritidis (according to SISTR cgMLST), and vast majority are MLST sequence type 11, as Enteritidis tend to be (I think). They are also all more than 90% Salmonella by Kraken2 assignment as well. All of them pass Enterobase criteria for assembly. I've also removed ~ 70 samples that had too many SNP differences acc. to snippy - most turned out to be other sequence types.

I've missed nullarbor completely, looks very interesting and useful. I've put together my own pipeline to process our genomes. I'll try making mash tree and see if I see anything funky there.