thoughts on finding all strains of a bacterial species in multiple metagenomes

request sent privately:

I am interested in finding strains of an bacteria (say E. Coli) in ~100 metagenomes. What’s the best way to do this?

Misc thoughts:

suggest adopting a triage workflow: do the search with as many examples as you can to be as inclusive as possible, and then filtering results afterwards to weed out things that are bad.
specifically, start with k=21 search (to find all species level matches) and then use k=51 to nail down specific strains - the k=51 results should be highly strain specific!
as long as it's for a reasonably well known species with a genome > 100kb, you can do a search at a fairly high scaled, like 10k, without fear.
an initial search to find matching metagenomes could be done with a merged sketch and prefetch - that is, grab all relevant species, merge them into a single sketch with sig merge and then find them in metagenomes, and only then do your strain specific profiling on those metagenomes.
we have fast ways of searching lots of metagenomes that are ~10x lower memory and ~10-1000x faster than the sourmash command line, using branchwater. might or might not be worth it but wanted to mention it. ping me here if you want to try!
picklists are pretty useful for extracting lists of genomes from e.g. GTDB. something like sourmash sig grep <pattern> <database> --csv search-results.csv will give you a manifest that you can then use with commands like sourmash sig merge ... --picklist search-results.csv::manifest to select just the genomes that match a particular keyword.
(you can do something similar based on taxonomy spreadsheets with sourmash tax grep, again, hit me up if you need specific commands)
to find the best match to the actual strain, do sourmash gather, I think? note that many metagenomes seem to contain fragments of many different strains, probably because metagenomes actually contain one or more composites of multiple strains
on that front, the new abundhist plugin may be useful to dissect abundances of strains in metagenomes. happy to describe workflow but it's something like "use intersect to get abundances of JUST your k-mer species pangenome in metagenome, then use abundhist to look at histogram and look for multiple coverage peaks"
and of course spacegraphcats is really your friend if you're trying to get ahold of the actual genomes.

that's my first set of hot takes, happy to discuss more! and of course @bluegenes and @taylorreiter will have more thoughts!

sourmash-bio / sourmash

thoughts on finding all strains of a bacterial species in multiple metagenomes #2704