skeeto / hash-prospector

Automated integer hash function discovery
The Unlicense
672 stars 27 forks source link

The "theoretical bias limit (bias = ~0.021)" for 32-bit hash functions isn't correct. #12

Open camel-cdr opened 3 years ago

camel-cdr commented 3 years ago

Introduction

I've been doing a lot of testing using the bias calculation method used in the repo lately, and I wanted to calculate the best possible bias (the bias that would occur if swapping one bit would change every bit with a probability of 50%).

Your blog post mentioned the value bias = ~0.021 in the context of 32-bit hashes, but since I'm testing other powes-of-two, I needed to calculate the bias by my self.

I came to the conclusion that the optimal bias for 32-bit hashes is ~0.006 and that, because I needed to modify the bias calculation code, triple32 has a bias of ~0.0083.

I validated this result by implementing a hash function that returns random number independent of the value to hash and the results confirm my calculations:

original method:
    calculated: ~0.021
    tested:      0.015610715938394663
my method:
    calculated: 0.006087376104758189
    tested:     0.006108424031481263

How I calculated the bias

I was initially skeptical towards the normalization method, because of the "FIXME" comment in the implementation of estimate_bias32, and thus I took the liberty to change how the bias is calculated.

The theoretical bias limit is equal to the mean of the probability distribution of obtaining a specific bias. Here that is a binomial distribution where with a 50% probability either one or zero is added to the counter, thus the mean is at n/2. For an easier time we can now center the normal distribution by subtracting the mean (n/2) and divide by the n to normalize the bias. Then we take the absolute value since we are interested in the average error, and we need to get rid of the sign to calculate that.

Now we have a half-normal distribution (a normal distribution where the absolute value of the result is taken) from which the mean can be calculated using variance*sqrt(2/pi), where the variance is the variance of the original normal distribution where the absolute value hasn't been taken yet. The original variance can be calculated using the formula sqrt(n*0.5*0.5), since we are dealing with a binomial distribution with p=0.5 and q=0.5.

Here is the complete formula to calculate the theoretical bias limit using maxima:

n: 2**32;
originalVariance: sqrt(n*(0.5)*(0.5));
variance: originalVariance*sqrt(2/%pi);
scaledVariance: variance/n*1000,numer;
print(scaledVariance);
-> 0.006087376104758189

This changed the way the code needs to calculate the mean a bit. So we divide by 4294967296.0 instead of 2147483648.0, although this is just a cosmetic change. And we also take the absolute value of the difference instead of squaring it and calculating the square root later. You'd assume that this would have the same result since abs(a) == sqrt(a*a), but sqrt((b*b) / b) != sqrt(a*a)/b and sqrt(a1*a1 + a2*a2) != sqrt(a1*a1) + sqrt(a2*a2), so this change is vital.

https://github.com/skeeto/hash-prospector/blob/ea7ee998543531d6337fcc01f022ea18f4599c83/prospector.c#L659-L666

double mean = 0.0;
for (int j = 0; j < 32; j++) {
    for (int k = 0; k < 32; k++) {
        double diff = (bins[j][k] - 2147483648L) / 4294967296.0;
        mean += fabs(diff);
    }
}
return mean * 1000.0 / (32 * 32);
skeeto commented 3 years ago

My bias values have no absolute meaning or units. They are really just a relative measure for comparing hash quality during an automated search. I could never get those values normalized properly probably because of my errors you've pointed out here.

My comment about "theoretical bias limit" is hand-waving since I didn't really know what to call it. I wanted to find a lower bound on bias since it was quickly obvious it wasn't going to be zero. So I set up a giant (16GB), 32-bit S-box, shuffled it thoroughly (/dev/urandom), then used it as an integer hash function in an "exact" bias measurement. This is the ideal case, a truly random permutation with no mathematical relationship between input and output, uncompressible, and no hash function I could devise could possibly beat this.

I ran this a few times (it's very slow!), and they all ended up around that ~0.021 bias (by my flawed measure). This was the lower bound, the limit. Since my best 3-round xorshift-multiply hashes hit this limit, there's no reason to try 4-round hashes. They're just as biased as an ideal permutation, so there's no room for improvement.

Your idea of accomplishing the same by replacing the hash function with a PRNG is interesting, though I'm skeptical that it accurately identifies the lower bound. It doesn't have the constraints of a permutation.

Since I got what I needed, I didn't keep that 32-bit S-box code. My new 16-bit prospector can test S-boxes very quickly and easily since they're so small, so I added a special mode for it. It takes only a few seconds to (roughly) find this same lower bound for 16-bit hashes:

./hp16 -S >/dev/null

Just like with 32-bit xorshift-multiply, three rounds are sufficient to hit this lower bound (hash16_xm3). Though two carefully-chosen rounds can come pretty close (hash16_xm2).

I think you're right about my math being wrong. I'm hesitant to change it at this point since I've got a bunch of data and momentum with the old measurement, and I'd at least identify the corrected measurement by a different name. Unless I'm mistaken, I don't think the correction changes the overall ranking of hash functions, i.e. under the old measurement A is better than B, but under the new measurement B is better than A. The prospector only needs relative rankings to function properly.

TheIronBorn commented 2 years ago

Unless I'm mistaken, I don't think the correction changes the overall ranking of hash functions

This isn't true.

Throwing some random uint32_t[32 * 32] at it quickly generates different orderings:

skeeto_a = 583.0685451747753
skeeto_b = 585.1873324547721
camel_a = 255.15427878349328
camel_b = 253.92012490920024

gist for the full input values: https://gist.github.com/TheIronBorn/62e118c615de4272ee0f5a0a8cd0b4a8

TheIronBorn commented 2 years ago

The best known (#19) is no longer a local optimum if we use this math. (using the same neighborhood as #19)

#19 best: 0.03668391400424298
[16 21f0aaad 15 f35a2d97 15] = 0.036331523915578146

it isn't a large difference though so the global best via the two methods might not be that different in quality.

TheIronBorn commented 2 years ago

Turns out my search in #19 wasn't quite exhaustive enough and [16 21f0aaad 15 f35a2d97 15] is actually a new best known by both math [16 21f0aaad 15 f35a2d97 15] = 0.10734781817103507

TheIronBorn commented 2 years ago

By the way I calculated the exact theoretical bias limit for the current math with Mathematica:

bins = 32; (* 32^2 is too slow *)
vars = Array[Unique[x]&, bins];
dists = Distributed[#, BinomialDistribution[n, 1/2]] &/@ vars;
dist = TransformedDistribution[Total[((vars - n/2) / (n/2))^2], dists];
Mean[dist] / bins
1000 / sqrt(n)

1000 / sqrt(2^32) = 0.0152587890625 Which as noted earlier is lower than the 0.021 estimate. Perhaps there's still room for improvement in 3-round hashes.

It also produces this issue's limit:

TransformedDistribution[Total[Abs[(vars - n/2) / n]], dists]

And after applying Stirling's approximation for the binomial coefficient we get exactly 1/sqrt(2 * Pi * n)

Note that though these limits differ only by a constant multiplier the variances might differ by more so I don't know how well we can translate them. (Mathematica can't handle Variance[camelDist])

TheIronBorn commented 2 years ago

This should help normalize the bias to bin size.

I hoped it would also help normalize it to sample size but even the z-score steadily increases with sample size

funny-falcon commented 10 months ago

Test with cryptographic quality function (SipHash) show 0.021 is valid estimation of bias limit.

See https://github.com/skeeto/hash-prospector/issues/19#issuecomment-1806775636