lemire / rollinghashcpp

Rolling Hash C++ Library
186 stars 33 forks source link

Make cyclic hash seedable, add new example/test #8

Closed standage closed 7 years ago

standage commented 7 years ago

I did a cursory evaluation of this rolling hash implementation for khmer back in May. Thank you for your comments at the time! After focusing on other things in the mean time, I'm returning to see if we can get this code to do what we need.

One of the issues was seeding. The random number generator has always been seedable, but this wasn't well exposed throughout the API. This pull request addresses this concern, although I'd welcome feedback/critique on the proposed solution.

Another issue is application-specific and has to do with the nature of DNA. Even though we usually represent DNA as a string of characters (such as GATTACA), this is really only half the story. DNA is double stranded with A pairing to T and C pairing to G, so the string GATTACA really represents the following molecule.

gattaca
|||||||
ɔʇɐɐʇƃʇ

In most contexts, we have no way of knowing whether the original piece of DNA sampled was from the top strand or the bottom strand, and so when we hash DNA sequences we typically want the two complementary sequences to hash to the same value.

This PR also adds a new example (example5.cpp) where I've tried to demonstrate the desired behavior. I used two cyclic hashes: one for the "top" strand of DNA (observed from the provided string, updated using forward updates) and one for the "bottom" strand (inferred from the provided string, updated using reverse updates). Then to get the hash for a particular k-mer (n-gram) in the DNA, I just XOR the current forward and reverse hashes.

Unfortunately, it doesn't seem to be working as expected. I don't know if this is because I made a mistake in the code, or if there's something I'm still not understanding. Any comments you might have would be welcome.

P.S. Apologies if the contributors to this code are well-versed in bioinformatics. I figured it would be safer to assume otherwise and to describe the problem in layman's terms. Thanks!

lemire commented 7 years ago

Excellent. I'll review and comment soon.

lemire commented 7 years ago

Ok. I'm back.

lemire commented 7 years ago

Ok. The main code looks sound. I have not looked at the example though. So I'll merge and review.

lemire commented 7 years ago

I now understand what you are trying to do and I, at a glance, expect it to work.

lemire commented 7 years ago

Please see fixed example. I think you had a parameter backward and an off-by-one error. Instead of merely fixing your example, I rewrote it so that it is more concise and general. You can now process arbitrarily large strings.

lemire commented 7 years ago

How do you write upside down? I know it is unicode, but do you have some kind of tool to do that?

standage commented 7 years ago

How do you write upside down? I know it is unicode, but do you have some kind of tool to do that?

I just followed one of the first couple of links on Google. The lower-case letters looked better than their upper-case counterparts, which is more typical for DNA. :-)

standage commented 7 years ago

Thanks @lemire! The updated example looks good, and yes you were correct about the fencepost error and swapped parameters on the reverse update. I need to revisit this in khmer when I return from my current travel.