sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
465 stars 78 forks source link

removing redundancy #295

Closed sheephorse closed 6 years ago

sheephorse commented 7 years ago

I would like to reduce my Illumina metagenome to something approaching unique sequences, i.e. remove sequences that are already represented (though not necessarily exact duplicates that could be removed by dedupe utilities). I want to maintain as much new sequence information as possible, while minimizing the size of the database. I could sort of do this with an assembly and mapping approach, but can I do this at the read level with sourmash?

ctb commented 7 years ago

Hi @sheephorse, what you may want is digital normalization; see

http://ivory.idyll.org/blog/what-is-diginorm.html

Let me know if you have any questions! Although diginorm-specific q should be asked over in github.com/dib-lab/khmer/ instead of sourmash.