skeeto / hash-prospector

Automated integer hash function discovery
The Unlicense
672 stars 27 forks source link

Added a new file to evaluate hash function that are "invertible for a given power-of-two sized domain" #10

Closed camel-cdr closed 3 years ago

camel-cdr commented 3 years ago

The article "Andrew Kensler's permute()" from Andrew Helmer describes a way to use hash functions, that are "invertible for a given power-of-two sized domain", to statelessly generate pseudo random permutations.

I wanted to create and evaluate such hash functions by my self, as the article and the related paper didn't offer any 64-bit hash functions. To do this I initially ran quite a lot of PractRand tests, but good hash functions stop failing (the range of 2^10 took to long to evaluate). So I needed another way to evaluate hashes. Since I used some of the hash function generated here to create my own composite hash I thought I might as well make the bias calculation compatible with hash function that are "invertible for a given power-of-two sized domain".

To evaluate those hashes I evaluate the bias for every power-of-two mask until 2^32 (this is modifiable with the -n flag). The seed is treated as just another input. Design wise I added the -f flag, since some defects of the Kensler's original hash functions were only visible when testing 32 and not 64 bit, since the bias in the upper bits of the seed overrode the bias.

I basically wrote this for my own testing, but I though that this might be appreciated in the main repo, especially since one could expand this to add hash function generation as well. (I didn't add this into prospector.c, since this isn't supported yet)

camel-cdr commented 3 years ago

Here is a graph comparing the bias between the three new hash functions, showing that the original hash function from the paper has bias issues in the upper size ranges. The paper mentioned that the hash function shouldn't be used for ranges larger than 2^27, this test shows that one should also be careful with ranges above 2^25 and also with ranges smaller than 2^10. By combining it with additional hash functions, that were tested separately, I could elevate those defects as well as expand the range to 2^64.

out

(The graph was generated using an evaluation quality of 22)

Note that this test didn't use the -f flag, my current hash functions looks quite optimal without the flag (for generating n ranges in the range [0:n]) but further improvement could be made and tested with -f.

Here is a graph generated with an evaluation quality of 18 and -f, as you can see my proposed hash doesn't perfectly incorporate the entire seed when the range is smaller, but it is really good for ranges larger than 2^20:

out

(You can't compare 32-bit hashes with 64-bit hashes when the -f flag is enabled, so this graph doesn't include kensler.so.)

camel-cdr commented 3 years ago

:Edit: I improved my hash function by reordering how the seed bits are used. The bias is basically the same for lower powers of two and lower with the -f flag. Here camel-cdr2 is the new hash I already committed:

out

With -f: out

I played around with camel-cdr2 with PractRand before I wrote this test tool, and it behaved about the same as camel-cdr, so I discarded it at the time. This shows that it's better, and therefore I replaced the hash in camel-cdr.c with the new one. (Sorry for abusing this PR as some kind of development log...)

camel-cdr commented 3 years ago

I'll close this for now, since I've moved my work on this from this fork to my own repo. If you are interested in getting the evaluation method into the main repo, just reopen this PR.

Note that the current content of the PR isn't 100% correct, because my current hash function isn't actually invertible. I've changed this now and there will be a blog post like overview of my work in https://github.com/camel-cdr/cauldron/tree/main/tools/random/permute in the close future.