orlp / polymur-hash

The PolymurHash universal hash function.
zlib License
318 stars 7 forks source link

Universal? #8

Closed avaneev closed 8 months ago

avaneev commented 8 months ago

Universal hashing implies N/M collision estimate for all collision test sets, and not (N^2)/M/2 like can be seen in SMHasher tests for Polymur. Just saying. https://www.cs.cmu.edu/~avrim/451f11/lectures/lect1004.pdf

orlp commented 8 months ago

The universal hash function for arbitrary-length strings is a unicorn (that is, it doesn't exist). Any completely universal hash function must store data scaling logarithmically with the size of the input set. Since no one makes a hash function that allocates 2^64 bytes of memory to store its data, it is generally understood that any hash function which hashes strings is therefore almost-universal.

When I say that PolymurHash is "a 64-bit universal hash function" in the introduction I don't mean to claim that it holds the "Universal" property over the set of all strings or over 64-bit integers, rather that it is a hash function with 64 bits of output that has universality claims. I immediately follow it up one sentence later with the exact collision probabilities claimed, which is a valid claim of almost-universality.

orlp commented 8 months ago

For example, the wikipedia article Poly1305 hash function also starts with "Poly1305 is a universal hash family". Since Poly1305 hashes strings and also doesn't allocate infinite memory we see that in fact it's no surprise its collision probability also scales with the input size.

avaneev commented 8 months ago

Well, it would be fair to call it "almost universal hash function" then. Or most authors of "good" SMHasher hash functions could come up with proofs. "Universal hashing" is an academic brand, and it's a bit unfair to exploit it not actually having an intention to prove the "original" universality requirement.

orlp commented 8 months ago

Or most authors of "good" SMHasher hash functions could come up with proofs.

The proofs of what I claim are in this repository.

"Universal hashing" is an academic brand, and it's a bit unfair to exploit it not actually having an intention to prove the "original" universality requirement.

I don't exploit anything, you simply take the most unreasonably strict reading of the word "Universal" that no real-world hash function for strings ever satisfies, and then try to apply it to a hash function over strings, (now) knowing full-well that that is literally impossible to begin with.

Almost-universality isn't something I just came up with. It is just as academic. It has formal definitions. For example the PolyR hash function paper also opens with "We describe a universal hash-function family, PolyR". Guess what? Yes, that is also colloquial and also used to mean almost-universality.

orlp commented 8 months ago

Universal hashing implies N/M collision estimate for all collision test sets, and not (N^2)/M/2 like can be seen in SMHasher tests for Polymur.

Also, this is a misunderstanding of Polymur's security claim. Polymur doesn't have a "N^2/M/2" collision probability, assuming we use N to mean the number of strings in the input space, and M to mean 2^64, like it is in the lecture notes you linked.

Polymur's collision probability bound scales with the length of the input string, not the number of strings in your input set. This can be an exponential difference, there are 256^1024 binary strings of length 1024, yet Polymur's collision probability only goes up to 256 * 2^-60.2.

I have no idea where you got the N^2 term from, or how you can claim that it "can be seen in SMHasher tests".

orlp commented 8 months ago

Perhaps you misunderstood the claim of the lecture notes?

image

Note that they count the expected number of collisions of a specific string x with all other strings to get to N/M, not the total number of collisions over an entire data set. That is also on the order of N^2/M/2 for a perfect universal hash. That is simply basic probability and has nothing to do with how good your hash function is.

avaneev commented 8 months ago

Note that they count the expected number of collisions of a specific string x with all other strings to get to N/M, not the total number of collisions over an entire data set. That is also on the order of N^2/M/2 for a perfect universal hash. That is simply basic probability and has nothing to do with how good your hash function is.

I would not handle N^2/M/2 as something "simple". For example, (let's call it) "set's collision estimate" of CRC32 is way lower than N^2/M/2 in many SMHasher tests yet its collision estimate for isolated strings is likely N/M. I just do not see how collision estimate of a specific string is good, if it's not easily translatable to set's collision estimate. The situation you are probably overlooking is that when you add a new string to an already constructed set with N^2/M/2 total collisions, the collision estimate of a new string cannot be N/M. Either I'm misunderstanding something or "universal hashing" concepts are very poorly defined.

PolyR paper defines a collision estimate like for cryptographic hashes, which are never treated as "universal" to my knowledge. With such liberal treatment of collision estimate, "universality" claim is a bit moot.

avaneev commented 8 months ago

N^2/M/2 is a formula used by SMHasher to calculate expected number of collisions.

avaneev commented 8 months ago

If you look closer, it's derivative of set's collision estimate which is N/M. But there's no mention of collision estimate's derivatives in "universal hashing".

orlp commented 8 months ago

Either I'm misunderstanding something or "universal hashing" concepts are very poorly defined.

I think you are misunderstanding, because they are very strictly defined. The collision probability always talks about two arbitrary but different strings chosen independently from the hash function. You only talk about more than two strings once you get into k-independence for k > 2. I make no such claims about being k-independent for k > 2.

I just do not see how collision estimate of a specific string is good, if it's not easily translatable to set's collision estimate.

It generally translates pretty well. It's based on the assumption the hash output is random (which universality claims bound how much you can be distinguished from random) and "throwing balls into bins" probability arguments.

PolyR paper defines a collision estimate like for cryptographic hashes, which are never treated as "universal" to my knowledge.

You are conflating "cryptographically secure hash" with the study of cryptographically secure message authentication codes. The latter has always been based on universal hash function theory (PolyR, Poly1305, Galois Counter Mode, UMAC, etc are all (almost-)universal hash families).

With such liberal treatment of collision estimate, "universality" claim is a bit moot.

It is not liberal at all, it is very precise.


N^2/M/2 is a formula used by SMHasher to calculate expected number of collisions.

Expected number of collisions between ALL strings, not the expected number of collisions between ONE and all other strings. That is what the lecture notes you linked compute.

Your original statement "universal hashing implies N/M collision estimate for all collision test sets, and not (N^2)/M/2 like can be seen in SMHasher tests" is false, because that is mathematically impossible. It's also not what the linked lecture notes claim.

If you look closer, it's derivative of set's collision estimate which is N/M. But there's no mention of collision estimate's derivatives in "universal hashing".

I have no idea what you mean by that.

avaneev commented 8 months ago

In "universal hashing", N/M collision estimate already implies a set S from U. So, of course, I'm talking about more than two strings in a set. What's the purpose of talking about sets having only 2 strings each?

avaneev commented 8 months ago

By N/M derivative I mean that if you keep adding strings to a set, yielding sets with N+1, N+2... strings, addition increases set's collision estimate by N/M after each operation, if set's current collision estimate is N^2/M/2.

orlp commented 8 months ago

In "universal hashing", N/M collision estimate already implies a set S from U. So, of course, I'm talking about more than two strings in a set. What's the purpose of talking about sets having only 2 strings each?

Basic universal hashing is only talking about pairs of strings. Look, in the very lecture notes you linked:

image

The universality bound is for ANY PAIR of strings x != y.

The expectation of N/M is then calculated from the bound on pairs:

image

Again, please note this is only talking about the number of collisions of ONE string x with ALL OTHER strings in the set.

What's the purpose of talking about sets having only 2 strings each?

The purpose is that it's possible to analyze and prove things about 2 strings at a time, after which you can use statistics to extrapolate that to calculate what the expected behavior is on sets of strings. Just like they did in the lecture notes you linked. They start with a universality bound on pairs of strings and then extend that to calculate expected collisions on sets of strings.

By N/M derivative I mean that if you keep adding strings to a set, yielding sets with N+1, N+2... strings, addition increases set's collision estimate by N/M after each operation, if set's current collision estimate is N^2/M/2.

Can you please give a clear definition of what you mean by "set's collision estimate"?

avaneev commented 8 months ago

"set's collision estimate" is what "estimated total number of collisions" equals to, in all SMHasher tests.

Why the paper you've quoted talks about "linearity of expectation", on what premises?

orlp commented 8 months ago

"set's collision estimate" is what "estimated total number of collisions" equals to, in all SMHasher tests.

Yes, universal hashing makes no direct claims about that. I never made any claims about that. Any implied bounds on that are entirely statistical, derived from the bound on pairs of strings.

In plain English, if it is very hard to find even just one pair of strings that collide, the expected number of collisions in a set of strings will also be low. Statistics makes this idea precise, but the analysis still starts with a bound on pairs.

Why the paper you've quoted talks about "linearity of expectation", on what premises?

https://math.stackexchange.com/questions/1810365/proof-of-linearity-for-expectation-given-random-variables-are-dependent

Also this isn't a "paper you've quoted", this is from the lecture notes you linked in your initial post. It's from https://www.cs.cmu.edu/~avrim/451f11/lectures/lect1004.pdf. Its your own source.

avaneev commented 8 months ago

Yes, partially it was my misunderstanding. The set's collision estimate is emergent and not really definable while "universality" constraints are much simpler. But I have a further question - where can I find a proof that PolyMur's permutation yields uniformly-random output for any input, because without that the Pr(x and y collide)<1/M bound cannot be considered proven.

I mean the proof that for all values of set S from U, the permutation produces a set of pseudo-random values - values that look like a random oracle, if observer does not know original values of S.

orlp commented 8 months ago

The set's collision estimate is emergent

Exactly.

and not really definable while "universality" constraints are much simpler.

There are also concepts of universality which do go beyond traditional universality, those are (almost-)k-independent hash function families. They are saying that if you have a tuple of k elements from some universe, the probability their hashes are any particular k-tuple is bounded.

To my knowledge we only have practical ways of constructing k-independent hash functions for relatively small k, and only for fixed-size inputs, not arbitrary-length strings when k > 2.

But I have a further question - where can I find a proof that PolyMur's permutation yields uniformly-random output for any input, because without that the Pr(x and y collide)<1/M bound cannot be considered proven.

As I said, Polymur is not exactly 1/M (where M = 2^64), but rather claims Pr(x and y collide) <= n * 2^-60.2 where n = len(x). So if you're hashing 256 byte strings the probability bound is 2^-52.2, and if you're hashing 1 megabyte strings the probability bound is 2^-40.2. Note that this doesn't necessarily mean the hash actually gets worse for all inputs when you feed it long inputs, just we can no longer mathematically prove the probability is low for all inputs. This bound is an upper bound, not a lower bound on collisions.

The proof for that is here: https://github.com/orlp/polymur-hash/blob/master/extras/universality-proof.md.

I mean the proof that for all values of set S from U, the permutation produces a set of pseudo-random values - values that look like a random oracle, if observer does not know original values of S.

That's a stronger property than regular universality, but luckily Polymur claims to be almost-2-independent, which is a stronger version of this anyway. Polymur claims that pairs of values look like random variables. That is, Polymur claims that if H is chosen independently at random from the family that for ANY m != m' and ANY x, y that

Pr[H(m) = x  &&  H(m') = y] <= n * 2^-124.2

The proof for this is also in the above document. You can just ignore one of the two variables and assume it collides to get a bound of n * 2^-60.2 on any particular output for a single input.

Note that again we prove the chance of any specific output is low, for a single input (or a single pair). A claim about "a set of pseudo-random values" where that set would have > 2 elements would again be an emergent probability from pulling multiple random numbers.

You can actually notice patterns in the output if you disable the post-processing Polymur does and generate a lot of outputs! Which is why it's called Polymur, and not Poly - the murmur permutation scrambles those patterns away. But that is not done for collision probability on the full 64-bit hash - the collision probability is proven and remains valid since the permutation is 1:1 and can't introduce collisions.

orlp commented 8 months ago

Also, this claim is not true (emphasis mine):

But I have a further question - where can I find a proof that PolyMur's permutation yields uniformly-random output for any input, [because without that the Pr(x and y collide)<1/M bound cannot be considered proven].

Consider a 64-bit hash function with a perfect 2^-64 collision bound. I can take that hash function, set the lowest bit to 0 and use that as a new hash function. It's now perfectly valid to claim this new function has a 2^-63 collision bound. But obviously the output is not uniformly random.

avaneev commented 8 months ago

I'm maybe overlooking something in your proof, but I do not see how your construction leads to pair-wise independence (50% difference in bits) of outputs. Maybe you are referring to some "common knowledge" I have not read/can't remember at the moment? I do not argue with PolyMur's practical performance, but a proof is something more formal.

avaneev commented 8 months ago

But I have a further question - where can I find a proof that PolyMur's permutation yields uniformly-random output for any input, [because without that the Pr(x and y collide)<1/M bound cannot be considered proven].

Consider a 64-bit hash function with a perfect 2^-64 collision bound. I can take that hash function, set the lowest bit to 0 and use that as a new hash function. It's now perfectly valid to claim this new function has a 2^-63 collision bound. But obviously the output is not uniformly random.

Uniform randomness is a statistical property. 1/M bound is emergent from probability theory. If bin A is used (among M total bins), the probability the next independent random number will also refer to bin A is 1/M.

orlp commented 8 months ago

I'm maybe overlooking something in your proof, but I do not see how your construction leads to pair-wise independence (50% difference in bits) of outputs.

That is a different definition of pairwise independence than the one used in universal hashing. The definition for pairwise independence in universal hashing can be seen on Wikipedia:

image

It's also called strong universality, in our case almost-strong universality since we don't quite reach 1/m^2 but something close.

Uniform randomness is a statistical property. 1/M bound is emergent from probability theory. If bin A is used (among M total bins), the probability the next independent random number will also refer to bin A is 1/M.

Sure, but as my counter-example showed, low chance of collision does not imply every bit of the output looks random. In the most extreme case, consider hashing 32-bit numbers to 32-bit numbers. The identity function has 0 collisions, but obviously is not random. My point is that "looks uniformly random" and "has low chance of collisions" are not necessarily the same claim.

avaneev commented 8 months ago

Ok, understood about the pairwise independence. Can you comment how collision bound in the proof is useful if k when hashing a set of values is usually unchanging. In such condition, I can only rely on message's properties, not the key. Referring to (14 + n/7) / |K|.

avaneev commented 8 months ago

As for zeroing the lowest bit example which reduces collision expectation, the issue with the example is that while collision expectation will change, and seemingly won't break anything, an initially random function will not be strictly uniformly-random anymore, and so probability theory won't be applicable. Strictly speaking, you can't use 1/M pair-wise collision estimate on a variable that was not previously shown to be uniformly-random. That for myself, maybe you do not look at things this way.

orlp commented 8 months ago

Can you comment how collision bound in the proof is useful if k when hashing a set of values is usually unchanging.

The point is that k is chosen uniformly at random independently from the values. As long as that holds, the bound is valid. At that point seeing k is "unchanging" doesn't matter. Since we assume the key was chosen independent from the value it doesn't matter whether k was chosen just now or ages ago - the choice was independent.

This does mean it's not secure to leak the hash values to an attacker, please read this section. If an attacker can see the hash values, he can potentially construct new values which are much more likely to collide, since now the new values are no longer independent from k.

orlp commented 8 months ago

Strictly speaking, you can't use 1/M pair-wise collision estimate on a variable that was not previously shown to be uniformly-random. That for myself, maybe you do not look at things this way.

Well, that's just mathematically wrong. Each input behaving as if being uniformly random is a sufficient but not necessary condition for collision resistance, or for using collision resistance results in statistically valid ways.

Consider a 16-bit to 16-bit identity hash function. It is not random at all but it literally can not create collisions. This isn't a trick either, it means if your input is 16 bits and you're building a hash table with 2^16 slots you can just use the identity hash function and never get collisions or the problems that come with them. That specific kind of hash table has a useful name: an array.

avaneev commented 8 months ago

Well, events of collisions depend on variations in messages or keys. f = k^7 * (f + m[6]) implies f depends on both message and key at the same time. If key is fixed in a set, then only messages affect events of collisions, functionally. As for myself, I cannot accept the collision bound for the case of a fixed key.

avaneev commented 8 months ago

Consider a 16-bit to 16-bit identity hash function. It is not random at all but it literally can not create collisions. This isn't a trick either, it means if your input is 16 bits and you're building a hash table with 2^16 slots you can just use the identity hash function and never get collisions or the problems that come with them. That specific kind of hash table has a useful name: an array.

I'm assuming an identity hash is non-random and it cannot be "universal" by definition. Then its collision expectation is knowingly 0, no need for a bound.

orlp commented 8 months ago

As for myself, I cannot accept the collision bound for the case of a fixed key.

There is no collision bound for a fixed key at all. That's not how universal hash functions work. The bound is always assuming the hash function is picked at random from the hash function family. This is fundamental.

Again, SSL is secured using this assumption - the key is chosen randomly and independently from the values, and the attacker never gets to see the hashes directly.

avaneev commented 8 months ago

There is no collision bound for a fixed key at all. That's not how universal hash functions work. The bound is always assuming the hash function is picked at random from the hash function family. This is fundamental.

But how this is practical? When one fills a structure with values the key is always fixed. Otherwise value lookup is just impossible.

Again, SSL is secured using this assumption - the key is chosen randomly and independently from the values, and the attacker never gets to see the hashes directly.

SSL is a stream cipher beside KEM; this is unrelated to hashing.

orlp commented 8 months ago

But how this is practical? When one fills a structure with values the key is always fixed. Otherwise value lookup is just impossible

The key isn't fixed. The key is chosen randomly at startup when the hash function is initialized independently from the values.

SSL is a stream cipher beside KEM; this is unrelated to hashing.

SSL includes authentication of the data. Authentication in 2023 is typically done with with Galois Counter Mode or Poly1305, both of which are universal hash based constructions.


Either way, I'm sorry, this is taking up too much of my time now. I would strongly suggest doing some more reading on what universal hashing is or how it works on a fundamental level.

The original paper that introduced universal hashing is Universal Classes of Hash Functions by Carter & Wegman. I would suggest starting there.

avaneev commented 8 months ago

No problem. However, the key is fixed during application's lifetime usually. It's practically the same as selecting a single hash function from a family, for unlimited duration of time, in case of a server application, for example. Okay, abstractly it is an independent argument to the hash function, but functionally it does not change. MAC is another story - there key and/or nonce changes frequently. (GCM is stream cipher's operation mode)

avaneev commented 8 months ago

You may delete the issue if it seems unuseful or confusing to potential users.

avaneev commented 8 months ago

Another possible issue I've found in PolyMur's proof, and referring to your stance about 1/M pair-wise collision bound (that uniform randomness of hash outputs is out of the question). What your proof is lacking I think is that it does not show that pair-wise collisions of outputs for m and m' are bounded to 1/M. Your proof relies on pre-requisite that keys are chosen in a uniformly-random manner. But the case when both m and m' use the same key is not covered. So, a central pre-requisite of "universal hashing" (Pr[x and y collide]<1/M) for the case of same keys in x and y is not proven.

orlp commented 8 months ago

Your proof relies on pre-requisite that keys are chosen in a uniformly-random manner. But the case when both m and m' use the same key is not covered.

It absolutely is. The key is not chosen per message, the key is chosen per function. The choice of k is what determines hash function H from the family. So in the claim

Pr[H(m) = x  &&  H(m') = y] <= eps

there is only a single H, and thus a single choice of k.


Look Alexey... I'm aware of your work on komihash. You clearly know a lot about traditional hash functions and are not a stupid man. I say this, because I really need you to change your approach from trying to jump on the first thing that looks wrong to you to trying to learn instead.

Universal hash functions are a mathematical construct. To understand their claims and proofs you have to be very careful, and understand the mathematical background. This is work, and it's not intuitive. I understand it can be hard and confusing.

But claims this and that isn't proven (especially things as basic as linearity of expectation) show you simply haven't done the homework. Please read the paper I have suggested to you, and really try to understand the math. Another suggestion is to read High Speed Hashing for Integers and Strings and do the exercises. Many of the things you say "hey, this looks wrong" are covered in those exercises - they force you to examine your own assumptions.

It is hard, I understand. I struggled with it too! But you have to get these fundamentals right before you can start to analyze more complex constructions.

avaneev commented 8 months ago

Well, I expected you would tell me to read stuff. Yes, my understanding of "universality" constraints was initially incomplete - the issue I had is with the total collision estimate of a set - I expected it's the first thing a "universal" hashing has to follow. Then I do have issues with terminology - it changes, it may be different in other sources, but it does not mean I do not know that probabilities of two uniformly-random events sum.

However, this in no way makes my further questions "absurd". I'm like you are interested in obtaining a proof of universality, and I'm taking your proof as some kind of a basis. For example, how you can prove "linearity of expectation" for your hash if you did not prove collision bound for the case of same keys.

MACs are totally different in this respect - they need to prove collision bound for independent keyed streams, not for a local data set that always uses the same key.

orlp commented 8 months ago

I never said your questions were absurd. They're very natural questions. They are however also very basic questions covered in any text on universal hash functions, and are not questions specific to Polymur. Your questions apply to any universal hash function family.

Which is why I said this takes too much of my time, and ask you to do the homework needed to cover the basics. If anything is unclear specifically about Polymur or its design I'm happy to help. But I'm not here to give a 1 on 1 informal lecture on universal hashing.

avaneev commented 8 months ago

I do not say your proof is wrong - I only say it applies to a specific case of MAC hashing, when key is uniquely generated for each hashing operation, without any requirement to store hashes in a single data set.

orlp commented 8 months ago

I only say it applies to a specific case of MAC hashing, when key is uniquely generated for each hashing operation

Please read section 4 "authenticating multiple messages" from New hash functions and their use in authentication and set equality by Wegman and Carter. It proves that for any strongly universal hash function family you can reuse the same hash function for authentication as long as you hide the hash values from the attacker (which they do with a one-time-pad).

There are similar proofs that for any strongly universal hash function hash table access with chaining takes 0(1) expected time.

I do not say your proof is wrong - I only say it applies to

The thing is, the proven claim has a standard form. It proves the almost strong universality. Because of that we can reuse results that only rely on that property, such as the above results.

This is why I say your questions are not about Polymur! If your problem isn't with the proof but with its applicability you have general questions about universality, for which I already pointed you to multiple resources to learn from.

avaneev commented 8 months ago

Is PolyMur "strongly universal", and where I can gather that from the proof? Sorry if that's obvious.

In either case, my opinion is that if "non-strong universality" implies collision bound only with a pre-requisite of "random choice of family functions", it is practically only useful for MACs. OK, not arguing with theory, but theories have their limits of applicability, and it's not always obvious from the theory itself where it isn't applicable.

Anyway, thanks for the discussion. I still think "almost universal" claims are a bit of an exaggeration, an unfair advantage. And I think most of the hash functions that pass all SMHasher tests are practically "universal" except most of them have no formal proofs. Because SMHasher was built to test some basic statistics that are also claimed for "universal" hash functions.

orlp commented 8 months ago

Is PolyMur "strongly universal", and where I can gather that from the proof? Sorry if that's obvious.

Polymur is almost-strongly universal. That's this claim from the proof:

If H is chosen randomly and independently then for any m != m' and any hash outcomes x, y we have

Pr[H(m) = x  &&  H(m') = y] <= n * 2^-124.2

Instead of phrasing it as a probability, the paper by Carter and Wegman simply counts the number of functions H in the family such that H(m) = x && H(m') = y. But it's the same concept, divide this count by the total number of possible functions to get the probability (since we choose the function at random from the family).

In either case, my opinion is that if "non-strong universality" implies collision bound only with a pre-requisite of "random choice of family functions", it is practically only useful for MACs. OK, not arguing with theory, but theories have their limits of applicability, and it's not always obvious from the theory itself where it isn't applicable.

Well, it's also useful in the case of hash tables as it proves expected O(1) access time when using chaining, which is explicitly the intended purpose of Polymur. You can't just discount this proof 'in your opinion'. This is math, not opinion. Most modern hash table implementations already generate a random hash function per table, or per program, to protect against HashDoS.

Anyway, thanks for the discussion. I still think "almost universal" claims are a bit of an exaggeration, an unfair advantage. And I think most of the hash functions that pass all SMHasher tests are practically "universal" except most of them have no formal proofs. Because SMHasher was built to test some basic statistics that are also claimed for "universal" hash functions.

I don't think it's unfair. It is easy to make hash functions that pass SMHasher but for which an attacker can still construct arbitrary multicollisions.

avaneev commented 8 months ago

I don't think it's unfair. It is easy to make hash functions that pass SMHasher but for which an attacker can still construct arbitrary multicollisions.

I doubt it, if the function passes Perlin Noise and Seed tests as well. But as SMHasher fork's author notes, quick brute-force generation of collisions is possible for all "fast" hashes. Except server applications usually do not give even a chance to flood with unnecessary information.

At the moment I think I'll leave this 40 year old "universality" alone for the sake of happier life. I can't swallow a proof that refers to |K| and its variation whereas functionally k is unchanging.