sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

using sourmash to select best genome for mapping #2334

Open ctb opened 1 year ago

ctb commented 1 year ago

This came up on microbial bioinformatics slack, thought I'd share - topic was selecting from many viral genomes.

it should be possible to use large-scale ANI-style analyses to select the closest genome for mapping. we’ve been doing this with sourmash and genome-grist for metagenomes, and I know tools like ganon and I think kmcp can do the same thing. with sourmash I would say the first thing to try is:

  • sketch all your genome references with sourmash sketch dna -p scaled=100 *.fna
  • do the same with your metagenome(s)/shotgun reads
  • run sourmash prefetch <metagenome>.sig genome*.fna.sig --threshold-bp=0 -o matches.csv
  • sort matches.csv on f_match_query and pick the highest value (this is "k-mer detection" per https://github.com/sourmash-bio/sourmash/issues/2170) and use that match_name as the reference genome.

happy to help troubleshoot here or on sourmash issue tracker if there is interest.

A more fun and “sophisticated” approach that could go horribly awry is to use sourmash gather after the prefetch to develop a minimum metagenome cover, but that’s only for people that are ok with expending some of their time and energy on a potential wild goose chase (which I am happy to support, but, you know, still)

ctb commented 11 months ago

Added to #2184 in 69dfbc08:

Can I use sourmash to determine the best reference genome for mapping my reads?

Yes! (And see the FAQ above, How do k-mer analyses compare with read mapping?

If you're interested in picking a single best reference genome (from a large database) for read mapping, you can do the following:

If you want to map a metagenome to multiple references, consider using sourmash gather and/or the genome-grist workflow.

(This is also known as "read recruitment.")