Make cyclic hash seedable, add new example/test

standage commented 7 years ago

I did a cursory evaluation of this rolling hash implementation for khmer back in May. Thank you for your comments at the time! After focusing on other things in the mean time, I'm returning to see if we can get this code to do what we need.

One of the issues was seeding. The random number generator has always been seedable, but this wasn't well exposed throughout the API. This pull request addresses this concern, although I'd welcome feedback/critique on the proposed solution.

Another issue is application-specific and has to do with the nature of DNA. Even though we usually represent DNA as a string of characters (such as GATTACA), this is really only half the story. DNA is double stranded with A pairing to T and C pairing to G, so the string GATTACA really represents the following molecule.

gattaca
|||||||
ɔʇɐɐʇƃʇ

In most contexts, we have no way of knowing whether the original piece of DNA sampled was from the top strand or the bottom strand, and so when we hash DNA sequences we typically want the two complementary sequences to hash to the same value.

This PR also adds a new example (example5.cpp) where I've tried to demonstrate the desired behavior. I used two cyclic hashes: one for the "top" strand of DNA (observed from the provided string, updated using forward updates) and one for the "bottom" strand (inferred from the provided string, updated using reverse updates). Then to get the hash for a particular k-mer (n-gram) in the DNA, I just XOR the current forward and reverse hashes.

Unfortunately, it doesn't seem to be working as expected. I don't know if this is because I made a mistake in the code, or if there's something I'm still not understanding. Any comments you might have would be welcome.

P.S. Apologies if the contributors to this code are well-versed in bioinformatics. I figured it would be safer to assume otherwise and to describe the problem in layman's terms. Thanks!