Open bovee opened 6 years ago
See https://github.com/bcgsc/ntHash/issues/7 for a discussion of a Rust implemention of ntHash.
Now that I figure out what version of ntHash I'm comparing to, I'll finish the implementation (with a proper Iterator
trait). I can't really build a Hasher
because it is more for traditional checksum hash functions (where you accumulate changes with write
and then finish
the hash)
@bovee Have you seen ntHashIterator class in ntHash lib? This is a C++ wrapper over ntHash to iterate on a sequence.
@luizirber and @bovee please let me what a C++ binding for ntHash should look like. Do you mean C++ version of ntHash instead of C version? or something like ntHashIterator would work for you? I can also help with Rust translation.
I think this is ready for testing: https://github.com/luizirber/nthash
@luizirber nice work! just a quick note here in the Rust implementation. Could it be possible to replace both match
with a lookup array. I think match
is similar to C/C++ switch which is about 10-20% slower than a lookup array according to our past experience in ABySS and other tools. It could be 256 entry-table similar to what we have as seedTab in nthash.hpp, or even we can use a compressed version of 16-entry seedTab with few extra operations.
@bovee @luizirber How do you handle non-ACGT? Just wanted to mention there are specific functions in nthash.hpp for this purpose.
@mohamadi The current behavior is to panic
on non-ACGTN, but in the (very preliminary) pull request I put together (https://github.com/luizirber/nthash/pull/2) using a lookup array, I treat every non-ACGT as an N (i.e. zero). I think panic
ing is probably a better default for most use cases though?
(Also, it'd be appreciated if we could move the conversation over to @luizirber 's ntHash repo)
Idea here.
Murmurhash is fast, but it would potentially be faster to use a hashing function (like ntHash ) that doesn't require a full recomputation on each new k-mer.
Unfortunately, to use ntHash itself we'd either need C++ bindings or a translation into Rust.