Open luizirber opened 5 years ago
agreed with all things :)
@luizirber and I recently had a conversation via slack about naming conventions for signatures. His explanation was helpful and I think it could be helpful to add it to the documentation. At the least, now it wont be lost to the ages in disappearing slack threads:
Taylor: "what’s the correct terminology for a signature that contains multiple minhashes in sourmash? like if I create a signature for each contig in a fasta, and put them all in the same signature…what is the baby sig called?"
Luiz:
sooooo... I like calling signature
a collection of sketch
, and a sketch
can be a MinHash (for now), or a draff sketch
, or an HLL, or a bloom filter...
that's how the Rust code is organized (part of the discussion above)
so, how I like to think about it (with examples):
Sketch
: a MinHash(k=21, dna)
for one genomic datasetSignature
: a collection of Sketch
es: [MinHash(k=31), MinHash(k=21)]
, but all related to the same original genomic datasetIndex
: a collection of Signature
s. Not all indices support all signatures at the same time, so they might be constrained to one parameter (like k=21) and Sketch
type that all Signature must haveit's a bit silly to have so many hierarchies when we only have one implementation for Sketch
, and Index
all operate over MinHash
es... but part of that is also being able to quickly add new sketches or indices, and still support the core operations (search
and gather
, or more generally similarity
and containment
). That's part of the reason for https://github.com/dib-lab/sourmash/pull/556, for example
We also have a big conceptual dichotomy, where the code computing signatures behaves as if each SourmashSignature
may contain multiple MinHash
objects, but the code operating on signatures behaves as if each SourmashSignature
has a single MinHash
. Since the signature loading code enforces that, it all works ... as long as we serialize them first 😂. I think I introduced this silliness Back When and @luizirber faithfully copied it over into Rust, probably while shaking his head in dismay.
We also have a big conceptual dichotomy, where the code computing signatures behaves as if each
SourmashSignature
may contain multipleMinHash
objects, but the code operating on signatures behaves as if eachSourmashSignature
has a singleMinHash
. Since the signature loading code enforces that, it all works ... as long as we serialize them first joy. I think I introduced this silliness Back When and @luizirber faithfully copied it over into Rust, probably while shaking his head in dismay.
I'm well aware... But the idea of a Signature as a collection of Sketches came later anyway (during the refactoring), so don't blame yourself too much =]
On the bright side, I tried to keep the Rust codebase doing operations over Signatures instead of over Sketches. For now the first thing these operations do is extract the matching MinHash in each sig and do the regular MinHash operations, but I think it is quite doable to "hide" most of the operations (especially Containment and Similarity) inside the Signature impl in the same way.
This can be lifted to Python too, with the awesome benefit of avoiding a new copy of the MinHash in sig.minhash
every single time it's called (because that's what happens nowadays :fearful:)
I like how you avoided the word "just" in that comment in favor of "quite doable."
as I dug in to SourmashSignature
a bit more in #1179, I noticed that md5sum
is calculated on the sig.minhash
not on the SourmashSignature
itself. So this is another place where we have a split between how we operate on signatures vs how we generate them.
I think this should be punted to 5.0, or later.
We also have a big conceptual dichotomy, where the code computing signatures behaves as if each
SourmashSignature
may contain multipleMinHash
objects, but the code operating on signatures behaves as if eachSourmashSignature
has a singleMinHash
. Since the signature loading code enforces that, it all works ... as long as we serialize them first joy. I think I introduced this silliness Back When and @luizirber faithfully copied it over into Rust, probably while shaking his head in dismay.I'm well aware... But the idea of a Signature as a collection of Sketches came later anyway (during the refactoring), so don't blame yourself too much =]
:)
Over in #1392, I'm starting to make the clear distinction that search and gather return the original SourmashSignature
inserted into the Index
, without any modifications done during the search. This opens the door to doing searches on more complex SourmashSignature
objects (e.g. multiple sketches, multiple types of sketches) where the search uses the "best" sketch type shared between the query signature and the contents of the database, but returns the entire signature. There are conceptual challenges to work out (e.g. what does threshold mean, in the case where you might not know exactly what kind of sketches are being compared!?) but I think this is a desirable kind of flexibility.
This also makes https://github.com/dib-lab/sourmash/issues/198 even more interesting, because then you could build an SBT that lets you find matches based on the kind of sketch that was indexed, but returns a richer set of sketches (as part of the returned signature object).
in re https://github.com/sourmash-bio/sourmash/issues/2039,
@luizirber reminded me:
Another one might be HLL, it already is implemented/exposed as a sketch in Rust. Need to figure out the Python side (more than one sketch per sig? A select method for choosing sketches, like select in indices chooses sigs?)
and now I'm curious, what operations does the HLL (as implemented) currently support? definitely cardinality counting, and merge. does it support containment as a byproduct of merge?
and now I'm curious, what operations does the HLL (as implemented) currently support? definitely cardinality counting, and merge. does it support containment as a byproduct of merge?
since we are using the estimators from https://arxiv.org/abs/1706.07290, we can do similarity and containment
Caveat: estimators don't work well with wildly different cardinalities, so genome containment in metagenome (or vice-versa) don't work well =(
since we are using the estimators from https://arxiv.org/abs/1706.07290, we can do similarity and containment
Caveat: estimators don't work well with wildly different cardinalities, so genome containment in metagenome (or vice-versa) don't work well =(
ok, thx - this makes me less enthusiastic about putting a lot of UX time into adding it, but I think it is an excellent "nth" method to add to expand internal support for more sketches, because a lot of sourmash power comes from the containment stuff.
I think we should separate
query
,signature
andminhash
better around the codebase. It's all pretty entangled, but they are different things!We can also lift up some functionality from MinHash into Signature:
add_sequence
can be aSignature
method, and will calladd_kmer
(or equivalent) in all the MinHash defined for that signature. At the moment we do all this parsing during compute (for example), where for each sequence we need to iterate with the appropriate k size and so on.Another note:
Signature
is a collection of MinHash at the moment, but would be pretty interesting to allow it to keep HLL/BF/CMS/HistoSketch representations of the data too.Some examples
feature/bf_query
I'm attaching a Nodegraph with the Sig/MH content to the query to make search faster (but it is weird, because Signature shouldn't have a Nodegraph attached to it!)(things to consider on #556)