orlp / polymur-hash

The PolymurHash universal hash function.
zlib License
318 stars 7 forks source link

Better hash value for zero-length input #3

Closed oertl closed 1 year ago

oertl commented 1 year ago

From a practical point of view, it would be favorable if the hash function would return a better hash value even if the input has length zero. Ideally, the returned value would depend on the seed. A common use case is, for example, hashing strings, and it would be nice if no special handling was required for empty strings.

orlp commented 1 year ago

From a practical point of view, it would be favorable if the hash function would return a better hash value even if the input has length zero.

Better in what way? Any particular output is as good as any for a single input.

Ideally, the returned value would depend on the seed.

The problem is that it would require a fair few CPU cycles to mix the seed in such a way that we don't expose information about our secret seed to an attacker. And I think it's much more favorable to have an incredibly fast hash function for the empty string, as it is such a common input in general.

Note that PolymurHash makes no hard guarantees about the collision probabilities between hashes with different seeds. So Pr[H("", secret1) == H("", secret2)] = 1 does not violate anything. So ultimately I don't really see the value of having the empty string hash depend on the seed.

A common use case is, for example, hashing strings, and it would be nice if no special handling was required for empty strings.

Well, we still always need a zero-length check regardless, to prevent undefined behavior. We're not allowed to read any bytes at all when passed an empty string.

oertl commented 1 year ago

Better in what way? Any particular output is as good as any for a single input.

An attacker could use the information that empty inputs always map to 0. And there is no way to directly change this behavior as the seed is not incorporated. Furthermore, having 0 as a likely output is problematic when using the hash value for probabilistic algorithms like HyperLogLog or MinHash. Other hash functions like wyhash or komihash return a value that depends on the seed.

The problem is that it would require a fair few CPU cycles to mix the seed in such a way that we don't expose information about our secret seed to an attacker.

The result of zero length input could be precomputed in the initialization routine. So there is no big overhead of returning that precomputed value instead of 0.

Well, we still always need a zero-length check regardless, to prevent undefined behavior. We're not allowed to read any bytes at all when passed an empty string.

Yes, but it would be nice if the user of PolymurHash does not have to add an additional branch around the hash function call to handle empty strings differently.

orlp commented 1 year ago

An attacker could use the information that empty inputs always map to 0.

I fail to see how.

The result of zero length input could be precomputed in the initialization routine. So there is no big overhead of returning that precomputed value instead of 0.

This precomputed value still needs to be stored somewhere.

Furthermore, having 0 as a likely output is problematic when using the hash value for probabilistic algorithms like HyperLogLog or MinHash.

Ok, suppose the empty string maps to x instead. Now x is likely. What now?

Yes, but it would be nice if the user of PolymurHash does not have to add an additional branch around the hash function call to handle empty strings differently.

I don't see why that would be necessary.

orlp commented 1 year ago

Actually, I can see that 0 being returned consistently for the empty hash can be problematic if trying to use a family of PolymurHash instantiations as a MinHash set similarity approximator. I'll have to reconsider based on that knowledge - I was strictly viewing the problem from the perspective of a hash table / collision resistance, where it does not matter.

oertl commented 1 year ago

Also, most HyperLogLog implementations consider the number of leading zeros of the hash, which would always be extreme for empty strings. Furthermore, it is quite popular (see for example the Guava library) to combine element hashes of sets using the xor- or the add-operation to get an order-independent hash value of the set. In both cases, a hash value of 0 would increase the chance of hash collisions. For example, the sets {"foo", "bar"} and {"foo", "bar", ""} would have the same hash value.

orlp commented 1 year ago

I think I will change the default behavior to include the zero-length string in the polynomial-style handling, for the principle of least surprise.

People that want optimal performance for hash tables when expecting many empty strings can precompute the value (or return 0) themselves and add a branch for it.

Unfortunately this is a breaking change in the output so I'll have to already bump the version to 2. Oh well.

orlp commented 1 year ago

Fixed in version 2.0, https://github.com/orlp/polymur-hash/commit/c6cc6884459560443e696604e9db3b6bb61a9bfa.