Open ctb opened 4 years ago
One interesting thought: we are in a weird territory of having nothing relatively novel, theoretically speaking; but we have something that, now that we've explored it in depth, has a lot of power. It seems futile to me to try split out what we've done from modulo hash, modimizer, etc. Also, we've gained a lot (or at least I have) from brad and rayan's reviews, in terms of understanding what is worth explicating. Do they get an acknowledgement, or an authorship? What about richard durbin, esp if we cover some of the ideas in his modimizer writeup in this paper?
hmm, you know what? it'd be fun to write this using manubot.
Under way here: https://github.com/dib-lab/2020-paper-sourmash-gather/
available on biorxiv as Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers.
Ref #606.
The reviewers for Pierce et al., 2019 were enthusiastic about some of the algorithmic ideas we hinted at in sourmash (and correctly pointed out that we did not adequately discuss them in the paper).
Brad Solomon wrote:
and
Rayan Chikhi wrote:
and
and
Now that the power of this fully operational k-mer subsampling mechanism is becoming apparent, it would be good to address these questions and/or lay out the information in #606 for CS-y folk. More random thoughts, with some of #606 reorganized around recent progress and thinking:
Should we/can we name scaled signatures "DensityHash"? c.f. also Richard Durbin's modimizer. I haven't directly compared to that, and don't know to what extent our "bottom density hash" approach is similar in implementation.
we should compare directly to minimizers. The modimizer text does a nice job of this. Perhaps we can flesh those out.
the disadvantages of what I understand the mash screen approach to be are significant with respect to some of our use cases (and vice versa: DensityHash is terrible on viruses, for example). Can we write a fair comparison?
the comparative approach used in the recent GTDB posts seems like another powerful argument for the densityhash approach.
a rough outline
(this is a very non-CS-y outline, because that's how I think. how this is actually written will depend on who takes up the mantle of first author and @ctb wrangler :)
editable here: https://hackmd.io/rxlN-Nv6Q-qhtim7EXCofw?both
outline of a DensityHash paper
Perhaps we should focus on DNA specifically, so that we can highlight issues of high error rates, and their impact; and also value of containment.
Brief discussion of densityhash approach - algorithm.
--scaled
)Practical details and implementation.
Simulations, maybe? It'd be good to do some.
Discussion of applications and comparison