orlp / polymur-hash

The PolymurHash universal hash function.
zlib License
318 stars 7 forks source link

Polymur32? #6

Open omasanori opened 1 year ago

omasanori commented 1 year ago

I am wondering if we can construct a 32-bit hash function with 2³¹ - 1 instead of 2⁶¹ - 1. On 64-bit processors it would not be worth doing it as one will do hash ^ (hash >> 32) or just use a half of the result instead, but it might be nice on 32-bit processors? Any thoughts?

orlp commented 1 year ago

Unfortunately, the collision probability for n bytes is not 2^61 - 1, but on the order of n * 2^-60.2. Note the factor of n: the collision probability linearly degrades with the input length, this is a property of all variable-length universal hashes with fixed entropy.

This is (in my opinion) not a problem for reasonable string lengths for 2^-60.2 as a starting point, because we have so much headroom. The same can not be said for 2^31 - 1 as a starting point.

I am working on another hash function that will be much faster for large inputs by using a two-stage algorithm that feeds a fixed-length universal hash into the polynomial hash. This variant could potentially support a 2^31 - 1 collision level more reasonably, but it is still dicey.

What 32-bit processors are you thinking of using this on?

omasanori commented 1 year ago

Thank you for your kind and prompt response!

Unfortunately, the collision probability for n bytes is not 2^61 - 1, but on the order of n * 2^-60.2. Note the factor of n: the collision probability linearly degrades with the input length, this is a property of all variable-length universal hashes with fixed entropy.

This is (in my opinion) not a problem for reasonable string lengths for 2^-60.2 as a starting point, because we have so much headroom. The same can not be said for 2^31 - 1 as a starting point.

I see. So, even when the hash value is truncated to 32-bit anyway, P611 would be better than P311.

What if the length of final value is even shorter than 32-bit, i.e. use it with hash tables? While distributed key-value stores use huge indices, hash tables in programming languages' standard library typically have much less number of entries.

I am working on another hash function that will be much faster for large inputs by using a two-stage algorithm that feeds a fixed-length universal hash into the polynomial hash. This variant could potentially support a 2^31 - 1 collision level more reasonably, but it is still dicey.

Cool!

What 32-bit processors are you thinking of using this on?

My starting point was not actually 32-bit processors but Janet, a scripting language for embedding into applications (like Lua), that use 32-bit integers and thus 32-bit hash functions. However, I can imagine that Arm Cortex-M (RP2040 for instance) or 32-bit RISC-V microprocessors will benefit from such a hash function.

orlp commented 1 year ago

What if the length of final value is even shorter than 32-bit, i.e. use it with hash tables? While distributed key-value stores use huge indices, hash tables in programming languages' standard library typically have much less number of entries.

To be entirely formally correct, in general to get a universal k-bit hash out of a universal n-bit hash where k < n... you need to hash the hash!

A common such hash is Dietzfelbinger's multiply-shift: generate a random odd number a and compute a*h >> (n - k).

In practice it's super common to just take the hash mod 2^k, which is technically not valid (consider the identity function, it technically is universal hash function without any collisions, yet mod 2^k it is not universal at all). However it being so common contributed to the reason I added the murmur-style permutation at the end, which should make it fine to just take the last k bits in practice.

omasanori commented 1 year ago

What if the length of final value is even shorter than 32-bit, i.e. use it with hash tables? While distributed key-value stores use huge indices, hash tables in programming languages' standard library typically have much less number of entries.

To be entirely formally correct, in general to get a universal k-bit hash out of a universal n-bit hash where k < n... you need to hash the hash!

A common such hash is Dietzfelbinger's multiply-shift: generate a random odd number a and compute a*h >> (n - k).

I did not aware of this, thanks a lot!