skeeto / hash-prospector

Automated integer hash function discovery
The Unlicense
688 stars 27 forks source link

Example lowbias32() demonstrates periodicity biases #21

Open davezarzycki opened 2 years ago

davezarzycki commented 2 years ago

The headline example lowbias32 (with constants 0x7feb352d and 0x846ca68b) has periodicity biases. As a refresher, the periodicity of a perfect/linear function is the number of rounds that it takes before you arrive back at the starting value. For example:

uint64_t period = 0;
uint32_t start = rdrand32();
uint32_t walk = start;
do {
    period++;
    walk = lowbias32(walk);
} while (walk != start);
assert(period < (1ul << 32));
printf("lowbias32 start 0x%08X period: 0x%lX\n", start, period);

Ideally, the period for a 32-bit perfect/linear function is 2^32-1 (or 2^32 if zero as input isn't special) but for the example "lowbias32", the period is usually but not always 0xF671EECE, which is a form of bias. Even worse, there seem to be pathological inputs that have significantly lower periods. For example:

lowbias32 start 0x00000004 period: 0x94C3412
lowbias32 start 0x0000000C period: 0x94C3412
lowbias32 start 0x00000027 period: 0x94C3412
lowbias32 start 0x00000029 period: 0x94C3412
...
lowbias32 start 0x00000548 period: 0x321A6B
...
lowbias32 start 0x00000623 period: 0xA6646

etc.

TheIronBorn commented 2 years ago

Do you think any of this can be mitigated by tweaks to the bias calculation? Currently it's just root mean square relative error (with constant scaling).

Also how do the Murmur hash and XX hash finalizers compare?

davezarzycki commented 2 years ago

I doubt the existing bias calculation can fix this. In general, if you don't want periodicity bias problems, then you either need to check it via brute force or pick designs where proof is possible without brute force (like LFSRs with primitive polynomials).

TheIronBorn commented 2 years ago

The best found and Murmur3 and xxHash all seem to have similar issues.

davezarzycki commented 2 years ago

I'd wager that if you started testing multi-bit bias rather than just single-bit bias then the underlying periodicity problems will become apparent.

TheIronBorn commented 2 years ago

Is that different than the current bias calculation? https://github.com/skeeto/hash-prospector/blob/2051e59d59e6bf6c9be081e020cfb4a44e886e26/prospector.c#L589-L594

davezarzycki commented 2 years ago

Unless I'm missing something, yes. That code is just testing a single bit at each position and recording the implications. I'm suggesting more combinations: every two-bit delta, every three-bit delta, etc.

Logan007 commented 2 years ago

SMhasher has a lot of quality tests of which some maybe cover it?

tommyettinger commented 6 months ago

How is this a form of bias? There are (2 to the 32) possible inputs to the self-feeding hash, and since you're measuring period, all inputs on the length-0xF671EECE subcycle will have 0xF671EECE as their period. Since that's the vast majority of the possible inputs, any other subcycles will necessarily be smaller. The stats do seem different than from the expectations for a random invertible mapping, but not by much. This would only really be a meaningful measure of bias if it reflected how these unary hashes are used, and from what I can tell, the outputs aren't meant to be used as inputs for the same hash.

davezarzycki commented 6 months ago

If a hash algorithm wants to score highly on PRNG tests (and many hash algorithms do these days) then any detectable input-to-output patterns are concerning. It implies, perhaps wrongly, that there might be weaknesses in the algorithm that can be accidentally tripped over at best or at worst exploited by an adversary.

Now you could say that being a "secure hash" is a non-goal. That's a valid and understandable tradeoff. But that still leaves the risk that there are pathological inputs that hash poorly for reasons unknown.

tommyettinger commented 5 months ago

No one said anything about a secure hash until now. Can of worms, opened. These are unary hashes; they have equivalent input domains and output ranges. What's more, the domain/range is so trivially small for a 32-bit space that to even consider security-relevant applications here is concerning. The way a unary hash would typically be used is not to feed its output in as the next input, but to run the hash on a counter (similar to CTR mode). If you look up the effective modes of operation for cryptographic hashes, feeding the entire output in as an input does not occur as a proven concept.

I'm not sure why you would want to feed a unary hash in on itself -- all unary hashes used as random invertible mappings will have subcycles, as discussed in the random invertible mapping article I linked. Some subcycles will be larger than others, and that isn't a bias for the purposes of a hash. Even an LFSR or Xorshift PRNG (both terrible if used as hashes) will have two total subcycles -- one for 0, and one for all other inputs. The only ways to go lower than 2 subcycles involve things like an additive recurrence, which is not good for a hash, an LCG, which would need extra steps to be useful as a hash and those would change its mapping properties, an XLCG (same idea as an LCG), XQO (it's mentioned on another issue here, might want to search), and various other things that are all not good as hashes on their own. Maybe there's something involving CRC32.

My point is, testing a unary hash for the period it has when misused will give you only passing results (if 1 cycle total is passing) for things that are bad hashes. For good hashes, you will only get the expected results for random invertible mappings. Please read O'Neill's article.

davezarzycki commented 5 months ago

My unstated but fundamental point (that isn't specific to this project so please PLEASE don't take this personally) that prompted me to create this GitHub issue is this: it's very common for hash algorithm designers/experimenters to view PRNG tests as a "buffet" where they pick the tests they think are fun/interesting, create a black box that "passes" said tests, and then declare "success". One can certainly do that and many projects do, but the quality of such hashes can be ripe for finding flaws/weaknesses. That's why I pointed out the trivial and curious periodicity behavior. There may be more lurking behavior for the approach this project uses but I'll leave that up to you all to investigate. Good luck!