write up algorithmic ideas behind scaled minhash

ctb commented 4 years ago

Ref #606.

The reviewers for Pierce et al., 2019 were enthusiastic about some of the algorithmic ideas we hinted at in sourmash (and correctly pointed out that we did not adequately discuss them in the paper).

Brad Solomon wrote:

The ‘modulo approach’ for sketch construction, despite being one of the main innovations of the method, is particularly unclear in the manuscript. The cited literature (Broder 1997) describes an approach that sub-samples hash values based on a modulo factor to address the inherent weakness of a Minhash in a mixture of several distinct components. However the description of the sourmash implementation instead describes splitting the hash space into ‘equal bands’ and selecting only the minimum band. As the existing modulo approach has no guarantees on equal-sized (or even equal-fraction as the manuscript claims elsewhere) sub-sampling, this appears to be a novel and significant contribution to the field. However there are no details that explain (1) how the hash space is divided, (2) how the minimum band is selected, and (3) how downsampling is performed.

and

The inclusion of limited abundance information is a particularly interesting improvement over standard MinHash sketches. The manuscript suggests that the abundance tracking can play a significant role when ‘comparing many signatures’ but there is no concrete claim to assess. While outside the scope or focus of this work, a follow-up piece which explores the theoretical or practical impact of systematically sub-sampled counting information would be potentially high impact.

Rayan Chikhi wrote:

A quick recap of the state of the art in containment search would be helpful. Here you claim to use ‘a modulo approach’. Mash screen and containment minhash use different approaches (see e.g. the blog post of ‘Mash screen’). It would be nice if, in this paper, the usage of the modulo approach was put into perspective compared to those two aforementioned methods.

and

In fact, in the blog post cited as reference 8, Ondov writes that “the modulo approach is problematic for metagenomic applications (e.g. finding a virus in a metagenome).” The problem is indirectly mentioned in the manuscript (“can sacrifice some of the memory and storage benefits of standard MinHash techniques, as the signature size scales with the number of unique k-mers”). It would be neat to get the authors’ comparative perspective here as to why using modulo is the better approach.

and

The description of the modulo approach used is imprecise. How is the hash space divided into s equal ‘bands’ (undefined term), precisely? Also, I suppose this somewhat different from the modulo approach proposed by Broder, and clarified in Mash screen’s blog post, but how so?

Now that the power of this fully operational k-mer subsampling mechanism is becoming apparent, it would be good to address these questions and/or lay out the information in #606 for CS-y folk. More random thoughts, with some of #606 reorganized around recent progress and thinking:

Should we/can we name scaled signatures "DensityHash"? c.f. also Richard Durbin's modimizer. I haven't directly compared to that, and don't know to what extent our "bottom density hash" approach is similar in implementation.
we should compare directly to minimizers. The modimizer text does a nice job of this. Perhaps we can flesh those out.
the disadvantages of what I understand the mash screen approach to be are significant with respect to some of our use cases (and vice versa: DensityHash is terrible on viruses, for example). Can we write a fair comparison?
the comparative approach used in the recent GTDB posts seems like another powerful argument for the densityhash approach.

a rough outline

(this is a very non-CS-y outline, because that's how I think. how this is actually written will depend on who takes up the mantle of first author and @ctb wrangler :)

editable here: https://hackmd.io/rxlN-Nv6Q-qhtim7EXCofw?both

outline of a DensityHash paper

Introduction
- examination of current approaches: minhash, modulo hash, mash screen, hll sketches/dashing.
- separately minimizers and universal k-mer hitting sets
- Discussion of where they fall short for various use cases, in a fair and impartial way :)
- discussion of specific needs that we hope to address
- sergey koren references? (see durbin comment)

Perhaps we should focus on DNA specifically, so that we can highlight issues of high error rates, and their impact; and also value of containment.

Brief discussion of densityhash approach - algorithm.
- choose a density parameter D (--scaled)
- divide hash space into bands of size H / D, where H = hash space size (2**64)
- drawback: sketches increase in size arbitrarily
- features this permits
- subsampling to lower densities
- streaming guarantees (never lose hash value - containment never decreases as you get more data)
- add and subtract hash values (category, "operations directly on sketches"?)
- abundance filter hash values (operations directly on sketches)
- for compatibility with MinHash, choose bottom band
- this permits conversion back and forth
- also lets us use minhash error bounds in many situations...
- idea of combined densityhash and minhash
- how this compares with modulo sketch
- address brad's comment: the existing modulo approach has no guarantees on equal-sized (or even equal-fraction as the manuscript claims elsewhere) sub-sampling
- also permits "stable identifiers" that never change...
- addition of abundance information
- modimizer additions/durbin
- addition of fwd/rc "sign" in sketch info (see comment)
- poisson statistics / random sequence foo
Practical details and implementation.
- pseudocode and diagrams.
- implementation in sourmash, for DNA specifically
- some code demonstrations / sourmash API?
- conversion between densityhash and minhash.
Simulations, maybe? It'd be good to do some.
Discussion of applications and comparison
- dscuss applications & and drawbacks for DNA specifically

ctb commented 4 years ago

One interesting thought: we are in a weird territory of having nothing relatively novel, theoretically speaking; but we have something that, now that we've explored it in depth, has a lot of power. It seems futile to me to try split out what we've done from modulo hash, modimizer, etc. Also, we've gained a lot (or at least I have) from brad and rayan's reviews, in terms of understanding what is worth explicating. Do they get an acknowledgement, or an authorship? What about richard durbin, esp if we cover some of the ideas in his modimizer writeup in this paper?

ctb commented 4 years ago

hmm, you know what? it'd be fun to write this using manubot.

ctb commented 3 years ago

Under way here: https://github.com/dib-lab/2020-paper-sourmash-gather/

ctb commented 2 years ago

available on biorxiv as Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers.

sourmash-bio / sourmash

write up algorithmic ideas behind scaled minhash #823

a rough outline

outline of a DensityHash paper