investigate new sketch type, `ArgMinHash`-based FracMinHash

ctb commented 2 years ago

per https://github.com/sourmash-bio/sourmash/issues/267#issuecomment-1117633898

@dkoslicki says:

In the FracMinHash sketches, does sourmash save the k-mers that led to the hash values, or just the hash values? If the former, then you could do something like this in order to compute multiple sketch sizes. Not trying to sound like a broken record, just curious :) Though this may not be an issue after the rust-ification of sourmash given the ridiculous speed.

and me:

compute a "founder" FracMinHash sketch for k=51, storing k-mers sequences as well.

then, descendant sketches (no longer FracMinHash bottom sketches, but still comparable as long as they share the same founder ksize) can be computed for any k < 51.

I've been looking for an alternative to MinHash and FracMinHash that would let us generalize the sourmash code base while offering similar time/space/operations.

The key thing this would offer over FracMinHash is the ability to sketch at one k-mer size K and then compare at any ksize < K. This would result in pretty significant space savings when distributing multiple databases.

You could imagine that something like this could also work for creating multiple molecule types from one original sketch - store a "founder" DNA or protein k-mer, calculate derivative sketches for shorter k-mer sizes upon demand.

Would we call this ArgFracHash or something? naming naming naming...

dkoslicki commented 2 years ago

ArgFracHash is great name! ArgMinHash is the name I went for when applying it to plain ol' MinHash, so it would at least be consistent

ctb commented 2 years ago

thinking a bit more -

we choose a set of founding k-mers for founding ksize, based on scaled parameter (i.e. k-mers that hash to values below _max_hash for that scaled).
for each sketch, we then store the founding ksize k-mer bases, in canonical text/DNA format (lexicographic minor)
to construct a sketch for a given ksize < founding ksize, we then truncate each founding k-mer to that size, and we don't do any further _max_hash filtering.

Then in order to downsample for any ksize, we would downsample the founding k-mers and recalculate at that ksize.

ctb commented 2 years ago

ref https://github.com/sourmash-bio/sourmash/issues/616

mr-eyes commented 2 years ago

Hi @dkoslicki,

I have read the manuscript, and it's very nice being able to compress/query the multiple kmer size information in that way. I have a question, and please, correct me if I am not getting something correctly.

Given that CMash uses KTST, which operates from O(log n) to O(n) in search and insertion. How is CMash expected to perform (in time and space) on large-scale data? In other words, I want to query many large SRA samples and construct a database of large genomes/metagenomes with small scale value.

dkoslicki commented 2 years ago

The ternary search tree was just a convenient way for us to efficiently do prefix lookups, so it's not vital for the idea of truncating larger k-mers to get a sketch with smaller k-mer sizes. Nonetheless, since we are aiming for classifying bacterial genomes in a metagenomic sample (so hash the database of genomes, stream the metagenome through) and the (Frac)MinHash sketch of bacterial genomes are relatively small, we've found in practice that its relatively performant. If I recall correctly, I built a KTST database with something like 100K bacterial genomes and the database ended up being on the order of 10GB. The query performance is ok (given that we did zero optimization and it's all in python), but highly parallelizable, so we took the approach of just throwing more cores at it.

So it really depends on your use case: if you are looking to do all pairwise containment estimations between a collection of metagenomic samples for a range of k-mer sizes, the TST wouldn't be of much use. If you want to query against a (relatively static) collection of sketches, than a TST is a handy way to compress it in a way that still allows prefix searches.

Do you mind saying a few more words about your envisioned use case? I.e. will your database have both genomes and metagenomes in it? Will your goal be to find entries in your database that make up your sample, or are you doing lookups (I have an unlabeled genome, find me its label by comparing to database entries)? etc.

ctb commented 2 years ago

Do you mind saying a few more words about your envisioned use case? I.e. will your database have both genomes and metagenomes in it? Will your goal be to find entries in your database that make up your sample, or are you doing lookups (I have an unlabeled genome, find me its label by comparing to database entries)? etc.

the single most obvious use case I see is distributing/storing just one zip file for all ksizes, for GTDB or Genbank. It wouldn't help with our indexed search collections and many of our other use cases, but the fast database search that we're working towards (tl;dr more cores) would be able to make use of it, and the single-core version is already pretty fast per https://github.com/sourmash-bio/sourmash/issues/1958.

mr-eyes commented 2 years ago

The ternary search tree was just a convenient way for us to efficiently do prefix lookups, so it's not vital for the idea of truncating larger k-mers to get a sketch with smaller k-mer sizes. Nonetheless, since we are aiming for classifying bacterial genomes in a metagenomic sample (so hash the database of genomes, stream the metagenome through) and the (Frac)MinHash sketch of bacterial genomes are relatively small, we've found in practice that its relatively performant. If I recall correctly, I built a KTST database with something like 100K bacterial genomes and the database ended up being on the order of 10GB. The query performance is ok (given that we did zero optimization and it's all in python), but highly parallelizable, so we took the approach of just throwing more cores at it.

Thanks @dkoslicki for the detailed explanation.

the single most obvious use case I see is distributing/storing just one zip file for all ksizes, for GTDB or Genbank. It wouldn't help with our indexed search collections and many of our other use cases, but the fast database search that we're working towards (tl;dr more cores) would be able to make use of it, and the single-core version is already pretty fast per #1958.

This is specifically the use case in my mind. Having Genbank/GTDB as ArgFracHash sketches and being able to query them using multiple k-mer sizes (as in sourmash prefetch/gather).

dkoslicki commented 2 years ago

the single most obvious use case I see is distributing/storing just one zip file for all ksizes, for GTDB or Genbank. It wouldn't help with our indexed search collections and many of our other use cases, but the fast database search that we're working towards (tl;dr more cores) would be able to make use of it, and the single-core version is already pretty fast per https://github.com/sourmash-bio/sourmash/issues/1958.

Agreed. And side note: I chose a TST merely since it was convenient at the time (benchmarked some prefix-query-capable data structures that were available on GitHub with Python bindings, and just picked one). I'm sure there is a more efficient way to both compress and allow for prefix searches, if someone had motivation to think about it for a bit (or code it up efficiently).

This is specifically the use case in my mind. Having Genbank/GTDB as ArgFracHash sketches and being able to query them using multiple k-mer sizes (as in sourmash prefetch/gather).

Perfect, then a TST will be your friend!

ctb commented 2 years ago

I am probably missing something, but I was thinking of dynamically generating specific k sketches for each search, using an approach akin to our current downsampling approach. I don't see a role for TSTs in that.

(probably at this point I should just code up the thing I'm thinking in Python.)

Another use case that is blindingly obvious (assuming we can get good space savings) - SRA-scale metagenome sketching and storage. Right now we store 3x sketches for k=21, k=31, and k=51, for every SRA metagenome; and to search at a different ksize, we need to revisit the raw data. If we could sketch at k=51 and get not only DNA but translated amino acid sketches out of it, that would be very enabling.

dkoslicki commented 2 years ago

@ctb the TST is just a convenient way to store a collection of sketches (or a sketch with low scale/tons of k-mers), so would indeed let you calculate different k sizes on the fly. If you have a single sketch, TSTs aren't of much use, but do give space/speed improvement when you have a bunch of them that you want to search against. Likely we're thinking along the same lines, just talking about it in different ways :)

sourmash-bio / sourmash

investigate new sketch type, `ArgMinHash`-based FracMinHash #2039