onecodex / finch-rs

A genomic minhashing implementation in Rust
https://www.onecodex.com
MIT License
93 stars 8 forks source link

Other Hash Functions #26

Open bovee opened 6 years ago

bovee commented 6 years ago

Idea here.

Murmurhash is fast, but it would potentially be faster to use a hashing function (like ntHash ) that doesn't require a full recomputation on each new k-mer.

Unfortunately, to use ntHash itself we'd either need C++ bindings or a translation into Rust.

bovee commented 6 years ago

RepHash is another rolling hashing function worth taking a look at.

bovee commented 6 years ago

See https://github.com/bcgsc/ntHash/issues/7 for a discussion of a Rust implemention of ntHash.

luizirber commented 6 years ago

Now that I figure out what version of ntHash I'm comparing to, I'll finish the implementation (with a proper Iterator trait). I can't really build a Hasher because it is more for traditional checksum hash functions (where you accumulate changes with write and then finish the hash)

mohamadi commented 6 years ago

@bovee Have you seen ntHashIterator class in ntHash lib? This is a C++ wrapper over ntHash to iterate on a sequence.

mohamadi commented 6 years ago

@luizirber and @bovee please let me what a C++ binding for ntHash should look like. Do you mean C++ version of ntHash instead of C version? or something like ntHashIterator would work for you? I can also help with Rust translation.

luizirber commented 6 years ago

I think this is ready for testing: https://github.com/luizirber/nthash

mohamadi commented 6 years ago

@luizirber nice work! just a quick note here in the Rust implementation. Could it be possible to replace both match with a lookup array. I think match is similar to C/C++ switch which is about 10-20% slower than a lookup array according to our past experience in ABySS and other tools. It could be 256 entry-table similar to what we have as seedTab in nthash.hpp, or even we can use a compressed version of 16-entry seedTab with few extra operations.

mohamadi commented 6 years ago

@bovee @luizirber How do you handle non-ACGT? Just wanted to mention there are specific functions in nthash.hpp for this purpose.

bovee commented 6 years ago

@mohamadi The current behavior is to panic on non-ACGTN, but in the (very preliminary) pull request I put together (https://github.com/luizirber/nthash/pull/2) using a lookup array, I treat every non-ACGT as an N (i.e. zero). I think panicing is probably a better default for most use cases though?

(Also, it'd be appreciated if we could move the conversation over to @luizirber 's ntHash repo)