sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
452 stars 78 forks source link

use cases #208

Open ctb opened 7 years ago

ctb commented 7 years ago

This issue can serve as a placeholder for use cases for sourmash/MinHash more generally.

Stuff we already have implemented:

Off-label and emerging use cases:

please add more here - we're in danger of forgetting all the great ideas we come up ;)

ctb commented 7 years ago

tetramer nucleotide clustering

basic kmer searching (--scaled 1)

ctb commented 7 years ago

contamination detection

ctb commented 7 years ago
ctb commented 7 years ago
ctb commented 7 years ago

via Cameron Thrash, "when we have pure culture genomes and want to see in which datasets we can recruit large numbers of reads for ecological comparison"

ctb commented 7 years ago

I think "find NCBI accession of genome you're working with" could actually be expanded quite a bit - this could be a super convenient approach to getting full taxonomic information for something quickly, linking out to public databases, and cross-referencing across what NCBI/SRA/IMG/etc have made available. Actually a pretty exciting solution to a whole host of problems.

ctb commented 3 years ago

differential presence of sequences per https://github.com/dib-lab/sourmash/issues/1266 is a pretty good one

ctb commented 1 year ago

metagenome "pivot query" use cases: https://github.com/sourmash-bio/sourmash/issues/485

ctb commented 1 year ago

Dealing with ridiculous amounts of data:

All samples were sequenced using Illumina shotgun metagenomic sequencing on the Novaseq 6000 platform with 150bp PE reads. Some samples were sequenced to excessive depth and the total dataset is approx. 20 Terabases in size. In addition to the metagenomes, we grew approx. 5,000 microbial isolates from a subset of the samples and sequenced 3,000 of those genomes to build an in-house microbial genome database