sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
472 stars 80 forks source link

discussion for why modulo hash / scaled signatures are awesome #606

Closed ctb closed 2 years ago

ctb commented 5 years ago

From private conversations with @luizirber @bluegenes @halexand recently -- scaled signatures are different from MinHash because:

These properties need to be clearly laid out, discussed, evaluated empirically, and (ideally) described theoretically. cc @dkoslicki

ctb commented 5 years ago

(of course, this all needs to be balanced against the point that they can grow indefinitely :)

ctb commented 5 years ago

note Richard Durbin's modimizer, which uses similar concepts! https://github.com/richarddurbin/modimizer - the README is informative for this issue.

ctb commented 5 years ago

a more succinct way of putting the containment guarantees above are "Containment never decreases as you get more data" (which is nice for streaming esp.)

ctb commented 4 years ago

note also that you can subtract and add scaled signatures, and filter them on abundance, and other things, without fear.

ctb commented 2 years ago

preprinted and available! see link in #823.