ogxd / gxhash

The fastest hashing algorithm 📈
https://docs.rs/gxhash
MIT License
767 stars 26 forks source link

Possible seed and value recovery #25

Closed Genbox closed 7 months ago

Genbox commented 10 months ago

For small inputs (<16 bytes), the length of the input is added to the input.

Value recovery During compressing, the byte array { 1, 2, 3, 4 } becomes a 16-byte vector <5, 6, 7, 8, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4> It is then run through Finalize, which runs 4 rounds of AES encrypt with static keys:

output = Encrypt(input, seed);
output = Encrypt(output, key1);
output = Encrypt(output, key2);
output = Encryptoutput, key3);

It is possible for an attacker who knows the seed to recover the original value by inverting the function.

output = Decrypt(input, key3);
output = Decrypt(output, key2);
output = Decrypt(output, key1);
output = Decrypt(output, seed);

We get a masked input, but can simply unmask it by taking the length at the end of the vector and subtract it from each of the values in the vector. <5, 6, 7, 8, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4> then becomes the original input <1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>

Invertibility in a non-cryptographic hash function is not an issue, but it makes certain attacks much easier.

Seed recovery In the case where an attacker does not know the seed, it is possible to recover the seed through brute-force, given only a single full-length hash output.

Let's say we are given a 128-bit hash value of <213, 184, 163, 143, 142, 216, 110, 155, 68, 66, 183, 166, 216, 225, 120, 52> We run it through our inverted hash function above and get <148, 13, 13, 13, 13, 13, 13, 207, 13, 13, 58, 13, 13, 144, 13, 13>

At this point we don't know that the original input was <5, 6, 7, 8, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4>, but we can brute-force the seed by running a single decrypt round: output = Decrypt(input, seed);

We know we have the right seed, if the result of decryption ends with a repeating integer (the length). We can filter false-positives by checking if the repeating characters start length-number-of-bytes into the vector. Once we have the seed, we can recover the original value.

ogxd commented 10 months ago

It is possible for an attacker who knows the seed to recover the original value by inverting the function.

Indeed, but how an attacker would know the seed?

it is possible to recover the seed through brute-force

A pure bruteforce approach takes on average 2127 iterations. Given a throughput of 10GiB/s, or 625000000 16-bytes hashes generated per second, that means this bruteforce attack would take 2127 / 625000000 = 2.72225891029 seconds or 8.61021 years.

Even if you pop a cluster of X servers with Y threads each this will take >> years.

The algorithm may have weaknesses, but I don't think bruteforcing seed is one. Also, if a seed were to be known, the seed is re-randomized for each map and each program instance.

Genbox commented 10 months ago

Indeed, but how an attacker would know the seed?

One way is that it defaults to 0 in some configurations. Another way is the seed recovery I described.

A pure bruteforce approach takes on average 2^127 iterations.

Since the seed is a 64-bit number, the worst case is 2^63 guesses. No need to brute-force all permutations of the 128-bit seed vector when the 64-bit value is enough.

Also, if a seed were to be known, the seed is re-randomized for each map and each program instance.

That is currently not a guarantee..

Genbox commented 10 months ago

Note that while MeowHash had stronger security guarantees, it suffered from the same vulnerabilities. If I remember correctly, before being fixed, it also mixed the seed at the end of the finalization. A common way to protect the seed is to mix it into the starting state instead.

Genbox commented 10 months ago

MeowHash 0.2 mixed it at the end with no other mixing, enabling recovery.

MeowHash 0.3 kept the end-mixing, but started mixing into each of the 8 lanes.

MeowHash 0.4 started using initial state mixing. This means an attacker brute-forcing the seed has to know the data being hashed in order to recover the seed.

MeowHash 0.5 (latest) still keeps that security guarantee while also switching to a larger seed for stronger security.

ogxd commented 10 months ago

Since the seed is a 64-bit number, the worst case is 2^63 guesses. No need to brute-force all permutations of the 128-bit seed vector when the 64-bit value is enough.

The seed is 128-bit https://github.com/ogxd/gxhash/blob/de2e5f9682bdc130ae06d313441af517b3b44df6/src/hasher.rs#L116-L125

That is currently not a guarantee..

That's not an algorithm issue then, that's an usage issue. We can debate if this API should be public or not, but I think it's a different subject.

Note that while MeowHash had stronger security guarantees, it suffered from the same vulnerabilities.

That's an incredibly detailed article! I was looking for something like this. It might take time to get through everything however 😅

MeowHash 0.4 started using initial state mixing. This means an attacker brute-forcing the seed has to know the data being hashed in order to recover the seed.

We can probably do the same 🤔

Genbox commented 10 months ago

The seed is 128-bit

I should have specified that I tested against the C# version. The seed is 64-bit there.

ogxd commented 10 months ago

The C# version was a quick upload for you to benchmark the performance, as you requested. Since, this repo (with rust impl) received feedback from this reddit thread and so I have made a few updates. The 2 most notable updates is the usage of a state wide seed (so 128-bit for the non AVX2 version) and applying the seed after the finalization (before it used to be right before the AES rounds in the finalization) I'll update the C# implementation when I have some time, but I don't plan to release an official GxHash nuget. I'd rather have it directly in System.IO.Hashing if the dotnet team approves it. Before proposing this officially I want to give it a bit more time, so I can continue collect feedback and on my end dive into some kind of cryptoanalysis. Your feedback and the links you shared help in this regard.

ogxd commented 7 months ago

Closing this as even the c# version of gxhash uses 128-bit seeds.