Spell out rationale for recommending cp32 over rrs1

zenhack commented 3 years ago

Per @bobg's comment in #24, we should spell out why cp32 is the recommended hash function.

chmduquesne commented 2 years ago

Hi,

Author of https://github.com/chmduquesne/rollinghash here. I am not very active on the project nowadays, but I am still trying to keep it in good shape.

I did tons of benchmark experiments back int the days, one of them being https://github.com/chmduquesne/rollinghash/blob/master/roll/main.go, where I wanted to see at which frequency of the n lower bits of a checksum are all 1 (the approach taken by bup).

I agree with the recommendation to always stick to buzhash{32,64} unless you know what you are doing.

One important thing about what you call rrs is that some bits of the checksum can actually never be 1. I first noticed that experimentally, then I confirmed it mathematically.

rrs is actually an adler32 checksum that has been "reversed": the first part A and the second part B have been swapped. The swapping was loosely justified by Avery Pennarun as "it works better experimentally", but as far as I am concerned, if you don't swap these 2 parts, it works terribly because you can never get a '1' in the middle bits. If you swap them, it gets a tiny bit better, but it is still terrible. This is due to the small window size (as wikipedia mentions it, the checksum is weak for small messages).

Mathematically, it is easy to see why. We have:

A = 1 + D1 + D2 + ... + Dn (mod 65521)

What is the maximal value of this A? Let us ignore the modulus for now and focus on the sum. Pennarun was choosing a window size of 64 bytes. Let's assume all of those 64 bytes have a value of 256: this makes for

D1+ D2 + ... + Dn = 64 x 256 = 16384

Since 16384 < 65521, the result remains the same after the modulus. Since the checksum is the result of appending B to A but A has a maximum value of 16384, some of the middle bits can never flip to 1.

This gets better if you swap A and B, because

B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)

The sum is bigger and wraps around faster. However, B is taken modulo 65521, which also prevents some of the bits to flip to 1.

Long story short, one should never use rrs.

Experimentally, buzhash behaves much better and covers the whole space.

chmduquesne commented 2 years ago

The best way to convince yourself about my statements is experimentally. Just run a checksum through a stream of random data and see how often the n lower bits of a checksum are all 1. This is the file I linked in the previous comment.

chmduquesne commented 2 years ago

Also, I never had the opportunity to discuss about this with anyone, but I think that all these projects are making it hard for the newcomer to understand what is going on by using a criterion "all the lower bits should be 1".

It would be much easier to understand if the criterion was based on a number of leading zeros, because one could use an analogy with bitcoin: all we are doing, by sliding a window with a required number of digits equal to some number, is finding a block of data with a requirement on the checksum. The same way bitcoin does when mining blocks.

If I were to design my own tool, I would switch to such a criterion.

zenhack commented 2 years ago

Yeah, we discovered some of this (https://github.com/hashsplit/hashsplit-spec/issues/3#issuecomment-677511092). I feel like I proposed somewhere just dropping rrs, and I feel like @bobg or someone talked me out of it, but I can't remember where.

hashsplit / hashsplit-spec

Spell out rationale for recommending cp32 over rrs1 #26