Tips on generating pseudoalignment

vehuardo commented 1 year ago

Thanks for making this useful tool!

I successfully used the distance tree workflow, but I struggle understanding how to generate the pseudoalignment (-a 'alignment.fasta' to be masked) in the ML tree workflow. Any tips?

For context, I have a dataset consisting of 439 Serratia marcescens genomes, and were able to execute the first (vertical pairwise) step in the ML workflow.

rrwick commented 1 year ago

If your S. marcescens genomes are diverse (e.g. spanning different lineages of the species), then I'd expect better results from the distance-tree workflow. And if your genomes are very closely related, you might want to try Gubbins instead of Verticall. I think the ML-tree workflow's niche is mainly for closely related datasets that are too big for Gubbins (e.g. thousands of genomes).

But you're of course welcome to try! I'd recommend using Snippy to make the whole-genome pseudo-alignment. Briefly, you run it on each isolate like this:

snippy --outdir sample_abc --R1 reads_1.fastq.gz --R2 reads_2.fastq.gz --ref reference.fasta --cpus 8

Then run snippy-core:

snippy-core --ref reference.fasta sample_*

And finally use snippy-clean_full_aln to produce a file ready for Verticall:

snippy-clean_full_aln core.full.aln > clean.full.aln

That final file (clean.full.aln) should be appropriate for both Verticall and Gubbins. Good luck!

vehuardo commented 1 year ago

Thanks Ryan, the snippy-approach you described worked well. You're probably right that this dataset would be suited for Gubbins or ClonalFrameML, but nice still - and eager to test on larger datasets!

rrwick / Verticall

Tips on generating pseudoalignment #4