netleibi / fastchunking

Fast text chunking algorithms for Python
Apache License 2.0
12 stars 2 forks source link

Not a Rabin Karp fingerprint #3

Open chmduquesne opened 5 years ago

chmduquesne commented 5 years ago

Hi,

As far as I understand, your code does not implement a Rabin Karp fingerprint but a polynomial rolling hash. Your source (https://github.com/lemire/rollinghashcpp) is probably wrong.

It is nonetheless a rolling hash, so there is no problem with the functional aspect of your implementation.

If you want to have a look at a real Rabin Karp fingerprint, I suggest you have a look at https://github.com/restic/chunker.

The major difference between an actual Rabin Karp fingerprint and your hash is that the Rabin Karp fingerprint is the remainder of a polynomial division over GF(2) (see https://en.wikipedia.org/wiki/Rabin_fingerprint and the original linked paper). In other terms, the hashed input is considered as a polynom of bits, not a polynom of bytes. You are encouraged to look at the original paper to double check what I say!

Cheers

netleibi commented 5 years ago

Thank you for your detailed report! I'll look into that and double-check the terminology as soon as I have some spare time.