tommyettinger / waterhash

Variant on Wang Yi's wyhash with 32-bit output, using at most 64-bit math
The Unlicense
25 stars 1 forks source link

Bad seed identification #2

Open wclodius2 opened 3 years ago

wclodius2 commented 3 years ago

The latest version of SMHasher includes the ability to detect and report bad seeds. Have you tested this on waterhash?

tommyettinger commented 3 years ago

Yes, I just did. I'm not entirely sure what internal secrets I should give it, so I gave all six of the large 64-bit constants, as well as all of the "magic" shift amounts for good measure (43, 53, 7, 29). Waterhash passed the current smhasher in full, and while that includes the bad seed checks, there might be a better set of numbers to give it. My working code is in https://github.com/tommyettinger/smhasher-with-junk/tree/master/seed_check , and the output of SMHasher is in seed_check/output/waterhash.txt .

wclodius2 commented 3 years ago

While it is impossible to test all values for a 64 bit seed, at least some of the tests reported by Reini Urban examine large subsets of the seeds, i.e., the lowest 2^32 values, 0x00000000XXXXXXXX, and the largest 2^32 values, 0xFFFFFFFFXXXXXXXX. Of course to evaluate that number of seeds in a reasonable time scale, the test set must be fairly small. If I am understanding some of his tests he tests whether strings of zero bytes map to a hash of 0, see https://github.com/rurban/smhasher/blob/master/doc/Murmur2C.txt.

tommyettinger commented 3 years ago

I'll see if I can replicate the type of test rurban does on wyhash. I have no idea how long it will take...

wclodius2 commented 3 years ago

I think what he does is test on keys of different length (1, 2, 4,8, 16?) all of whose elements are zero, to see if any returns a value of zero. With a hash speed of a few nanoseconds per element, I think this can take only a few minutes per 2^32 seed values.

tommyettinger commented 3 years ago

So I'm not sure, but it looks like the BadSeeds test only checks against an array of "internal secrets" in main.cpp, which is different for each hash: https://github.com/tommyettinger/smhasher-with-junk/blob/e0e07a2ebf058a779cb5b16469ed88f11c49a854/seed_check/main.cpp#L245-L253 . I'm not sure how SMHasher can find the internal secrets, or if it even is able to -- some other code specific to a hash may be finding bad seeds. The BadSeeds test seems to exist mostly to check for seriously bad seeds like "all 0", which it can check for any hash but only flawed ones should fail. It can also verify that a hash fails against known bad seeds. I might ask rurban how this works, or if there's documentation somewhere.

tommyettinger commented 2 years ago

Thinking about this some more, WaterHash has a structure, "mum-like," that should mean it loses some data occasionally (because it uses a multiplication that can have 0 for one term or both). I don't know what those seeds are, but they're probably out there. The story may be different (may) for the 64-bit hash here (that avoids 128-bit math), WootHash. It's a slightly different structure from wyhash, because WootHash's mum function incorporates both arguments into both multiplication terms, while wyhash, WaterHash, and WheatHash just multiply the terms (without changing them). Using both arguments might be enough to make bad seeds less common, but it probably doesn't eliminate them entirely. Another possibility is that there are bad seeds, but they aren't bad enough to be noticed by SMHasher's tests; the usage of both args on both sides may mitigate the problem enough. The only reason I'm considering the latter is that...

I'll post an issue on rurban/smhasher asking if there is documentation for how to test billions of seeds per run for quality; the code as far as I can see doesn't have a facility for evaluating many seeds in a run, just for verifying against a few known bad seeds.

wclodius2 commented 2 years ago

Thanks!