Closed standage closed 7 years ago
Excellent. I'll review and comment soon.
Ok. I'm back.
Ok. The main code looks sound. I have not looked at the example though. So I'll merge and review.
I now understand what you are trying to do and I, at a glance, expect it to work.
Please see fixed example. I think you had a parameter backward and an off-by-one error. Instead of merely fixing your example, I rewrote it so that it is more concise and general. You can now process arbitrarily large strings.
How do you write upside down? I know it is unicode, but do you have some kind of tool to do that?
How do you write upside down? I know it is unicode, but do you have some kind of tool to do that?
I just followed one of the first couple of links on Google. The lower-case letters looked better than their upper-case counterparts, which is more typical for DNA. :-)
Thanks @lemire! The updated example looks good, and yes you were correct about the fencepost error and swapped parameters on the reverse update. I need to revisit this in khmer when I return from my current travel.
I did a cursory evaluation of this rolling hash implementation for khmer back in May. Thank you for your comments at the time! After focusing on other things in the mean time, I'm returning to see if we can get this code to do what we need.
One of the issues was seeding. The random number generator has always been seedable, but this wasn't well exposed throughout the API. This pull request addresses this concern, although I'd welcome feedback/critique on the proposed solution.
Another issue is application-specific and has to do with the nature of DNA. Even though we usually represent DNA as a string of characters (such as
GATTACA
), this is really only half the story. DNA is double stranded withA
pairing toT
andC
pairing toG
, so the stringGATTACA
really represents the following molecule.In most contexts, we have no way of knowing whether the original piece of DNA sampled was from the top strand or the bottom strand, and so when we hash DNA sequences we typically want the two complementary sequences to hash to the same value.
This PR also adds a new example (
example5.cpp
) where I've tried to demonstrate the desired behavior. I used two cyclic hashes: one for the "top" strand of DNA (observed from the provided string, updated using forward updates) and one for the "bottom" strand (inferred from the provided string, updated using reverse updates). Then to get the hash for a particular k-mer (n-gram) in the DNA, I just XOR the current forward and reverse hashes.Unfortunately, it doesn't seem to be working as expected. I don't know if this is because I made a mistake in the code, or if there's something I'm still not understanding. Any comments you might have would be welcome.
P.S. Apologies if the contributors to this code are well-versed in bioinformatics. I figured it would be safer to assume otherwise and to describe the problem in layman's terms. Thanks!