sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
478 stars 79 forks source link

Better conceptual definition on minhash/signature/query #616

Open luizirber opened 5 years ago

luizirber commented 5 years ago

I think we should separate query, signature and minhash better around the codebase. It's all pretty entangled, but they are different things!

We can also lift up some functionality from MinHash into Signature: add_sequence can be a Signature method, and will call add_kmer (or equivalent) in all the MinHash defined for that signature. At the moment we do all this parsing during compute (for example), where for each sequence we need to iterate with the appropriate k size and so on.

Another note: Signature is a collection of MinHash at the moment, but would be pretty interesting to allow it to keep HLL/BF/CMS/HistoSketch representations of the data too.

Some examples

(things to consider on #556)

ctb commented 5 years ago

agreed with all things :)

taylorreiter commented 4 years ago

@luizirber and I recently had a conversation via slack about naming conventions for signatures. His explanation was helpful and I think it could be helpful to add it to the documentation. At the least, now it wont be lost to the ages in disappearing slack threads:

Taylor: "what’s the correct terminology for a signature that contains multiple minhashes in sourmash? like if I create a signature for each contig in a fasta, and put them all in the same signature…what is the baby sig called?"

Luiz: sooooo... I like calling signature a collection of sketch, and a sketch can be a MinHash (for now), or a draff sketch, or an HLL, or a bloom filter... that's how the Rust code is organized (part of the discussion above)

so, how I like to think about it (with examples):

it's a bit silly to have so many hierarchies when we only have one implementation for Sketch, and Index all operate over MinHash es... but part of that is also being able to quickly add new sketches or indices, and still support the core operations (search and gather, or more generally similarity and containment). That's part of the reason for https://github.com/dib-lab/sourmash/pull/556, for example

ctb commented 4 years ago

We also have a big conceptual dichotomy, where the code computing signatures behaves as if each SourmashSignature may contain multiple MinHash objects, but the code operating on signatures behaves as if each SourmashSignature has a single MinHash. Since the signature loading code enforces that, it all works ... as long as we serialize them first 😂. I think I introduced this silliness Back When and @luizirber faithfully copied it over into Rust, probably while shaking his head in dismay.

luizirber commented 4 years ago

We also have a big conceptual dichotomy, where the code computing signatures behaves as if each SourmashSignature may contain multiple MinHash objects, but the code operating on signatures behaves as if each SourmashSignature has a single MinHash. Since the signature loading code enforces that, it all works ... as long as we serialize them first joy. I think I introduced this silliness Back When and @luizirber faithfully copied it over into Rust, probably while shaking his head in dismay.

I'm well aware... But the idea of a Signature as a collection of Sketches came later anyway (during the refactoring), so don't blame yourself too much =]

On the bright side, I tried to keep the Rust codebase doing operations over Signatures instead of over Sketches. For now the first thing these operations do is extract the matching MinHash in each sig and do the regular MinHash operations, but I think it is quite doable to "hide" most of the operations (especially Containment and Similarity) inside the Signature impl in the same way.

This can be lifted to Python too, with the awesome benefit of avoiding a new copy of the MinHash in sig.minhash every single time it's called (because that's what happens nowadays :fearful:)

ctb commented 4 years ago

I like how you avoided the word "just" in that comment in favor of "quite doable."

ctb commented 4 years ago

as I dug in to SourmashSignature a bit more in #1179, I noticed that md5sum is calculated on the sig.minhash not on the SourmashSignature itself. So this is another place where we have a split between how we operate on signatures vs how we generate them.

ctb commented 3 years ago

I think this should be punted to 5.0, or later.

ctb commented 3 years ago

We also have a big conceptual dichotomy, where the code computing signatures behaves as if each SourmashSignature may contain multiple MinHash objects, but the code operating on signatures behaves as if each SourmashSignature has a single MinHash. Since the signature loading code enforces that, it all works ... as long as we serialize them first joy. I think I introduced this silliness Back When and @luizirber faithfully copied it over into Rust, probably while shaking his head in dismay.

I'm well aware... But the idea of a Signature as a collection of Sketches came later anyway (during the refactoring), so don't blame yourself too much =]

:)

Over in #1392, I'm starting to make the clear distinction that search and gather return the original SourmashSignature inserted into the Index, without any modifications done during the search. This opens the door to doing searches on more complex SourmashSignature objects (e.g. multiple sketches, multiple types of sketches) where the search uses the "best" sketch type shared between the query signature and the contents of the database, but returns the entire signature. There are conceptual challenges to work out (e.g. what does threshold mean, in the case where you might not know exactly what kind of sketches are being compared!?) but I think this is a desirable kind of flexibility.

This also makes https://github.com/dib-lab/sourmash/issues/198 even more interesting, because then you could build an SBT that lets you find matches based on the kind of sketch that was indexed, but returns a richer set of sketches (as part of the returned signature object).

ctb commented 2 years ago

in re https://github.com/sourmash-bio/sourmash/issues/2039,

@luizirber reminded me:

Another one might be HLL, it already is implemented/exposed as a sketch in Rust. Need to figure out the Python side (more than one sketch per sig? A select method for choosing sketches, like select in indices chooses sigs?)

and now I'm curious, what operations does the HLL (as implemented) currently support? definitely cardinality counting, and merge. does it support containment as a byproduct of merge?

luizirber commented 2 years ago

and now I'm curious, what operations does the HLL (as implemented) currently support? definitely cardinality counting, and merge. does it support containment as a byproduct of merge?

since we are using the estimators from https://arxiv.org/abs/1706.07290, we can do similarity and containment

Caveat: estimators don't work well with wildly different cardinalities, so genome containment in metagenome (or vice-versa) don't work well =(

ctb commented 2 years ago

since we are using the estimators from https://arxiv.org/abs/1706.07290, we can do similarity and containment

Caveat: estimators don't work well with wildly different cardinalities, so genome containment in metagenome (or vice-versa) don't work well =(

ok, thx - this makes me less enthusiastic about putting a lot of UX time into adding it, but I think it is an excellent "nth" method to add to expand internal support for more sketches, because a lot of sourmash power comes from the containment stuff.

ctb commented 2 years ago

ref https://github.com/sourmash-bio/sourmash/issues/1616