sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

What about using different hash functions (reversible, rolling, etc.) with sourmash? #1659

Open mr-eyes opened 3 years ago

mr-eyes commented 3 years ago

I have a simple/naive thought re reverse-hashing the kmers. We can provide an option to Minhash reversible = True to use a reversible hashing function like the Integer Hashing to retrieve back canonical kmers from sigs/hashes.

Some Pros & cons from my perspective:

ctb commented 3 years ago

thanks for posting this, @mr-eyes :)

So there are various bits that are easier and harder to do here, but it's a great idea to have it be in the code base. As a side note we did explore this a fair bit in khmer, and it is mentione in various issues (but not cleanly in one single issue - so you'd have to go through the whole issue tracker).

sourmash would also benefit from a looser tie between SourmashSignature and the MinHash classes; see https://github.com/sourmash-bio/sourmash/issues/1647 for some discussions here. Using a different hash function would fit entirely within the current structure, tho, so this is a bit tangential. Still it would be cool to expand our support for having multiple sketch types per input sequence!

But the biggest blocker is that simply that we'd have to reprocess all our databases, including the entire SRA... which is a lot...

luizirber commented 3 years ago

The cheat here is that we don't care how the hash is generated, as long as it is an u64. For example, ansible uses xxHash, and supporting nthash is easy too (since I wrote an nthash crate for draff). So a reversible hash function would just work (as long as it generates u64 hashes).

As @ctb mentioned, the discussion in https://github.com/sourmash-bio/sourmash/issues/751 is the best entry point for supporting comparisons based only on the hash_function. In core it is somewhat hardcoded, but an alternative would be to support custom hash_function by string-typing it (sort of like what happens on the Python API):

pub enum HashFunctions {
    murmur64_DNA,
    murmur64_protein,
    murmur64_dayhoff,
    murmur64_hp,
    Custom(String),
}

Main benefit here is allowing a escape hatch for more experimentation (because you can define a new hash function and use it across all the sourmash operations)

mr-eyes commented 3 years ago

@ctb @luizirber thanks for the thoughtful comments. I will summarize some points here that I will need to research.


Thoughts about design


P.S. I don't have a motivation or a specific usecase for implementing the reverse-hashing or re-generating the databases that will require full exposure of this functionality into the main sourmash version, but, it seems to be a demanded feature though 😄

ctb commented 3 years ago

Performance improvement ended up being decent, 40-50% speedup per https://github.com/dib-lab/khmer/pull/1792#pullrequestreview-74255466, no?

(I'm usually more concerned about API additions/extensions than implementation, since adding complexity to a new API can dramatically increase the maintenance burden. In comparison, I don't care if it takes 2x as long to analyze data. Strategies for performance improvement, e.g. "use Rust", are good, but worrying too much about the performance of a specific implementation usually provides short-term gains that inhibit long term growth.)

it seems to be a demanded feature though 😄

If the main motivation the ability to get k-mers back from hashes, that certainly is a frequently requested feature. I'm not convinced that reversible hashing is the best solution, given the other limitations, tho. It seems like #1653 is a pretty good interim solution and I would like to see that sooner rather than later ;).

If you just want to explore the concept generally, good on ya! But I suggest we wait until we have use cases that demand this functionality before distracting ourselves (or, you intentionally want to distract yourself with it for funsies :).

mr-eyes commented 8 months ago

Another really random thought here (maybe as a plugin idea): with a limitation to k-mer size of 32, we can store the 2-bit representation (integer) in signatures, but all decisions will be kept the same using murmur and s=42. This type of signature can be used with sourmash (but will need to hash the 2-bit repr to compare with other murmur-based sigs, and still be reversible to k-mer string.

ctb commented 8 months ago

hah, khmer use(s/d) two-bit representations!

I don't like the k-mer size limitation of 32 tho.

ctb commented 8 months ago

FYI this paper: The K-mer File Format: a standardized and compact disk representation of sets of k-mers