samuel-lucas6 / draft-lucas-bkdf

An Internet-Draft for the Balloon Key Derivation Function (BKDF), a memory-hard password hashing and password-based key derivation function.
Other
4 stars 1 forks source link

Fixes and suggestions #14

Closed Sc00bz closed 3 weeks ago

Sc00bz commented 2 months ago

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L184 List should be called Array. This can be called an array of arrays or 2D array. Maybe consider ByteArray and BlockArray which cover the two use cases.


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L186-L187 Consider making these 32 bit functions and keeping LE64(x) for counter. Since these are all defined as 32 bit max integers. Except MAX_SPACECOST but that's 0 which is less than MIN_SPACECOST. Also related: https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L218 Consider making spaceCost an integer 0 to 32 and represents 2**spaceCost blocks of memory.


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L189 Replace "integer" with "float" or "floating point". Or remove "the integer".


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L194 All mentions of XOF should be removed except this one because you are converting an XOF into a hash and this part might be missed. Also you are restricting XOFs more so than necessary. XOFs should not output more data than its internal state and should only call the internal base hash function the least amount of times to output anything (e.g. SHAKE128 has an internal state of 1600 bits but will call the internal base hash function more than the minimum for >1344 bits. Thus the max output for SHAKE128 is 1344 bits thus "SHAKE128-1344" or is it "SHAKE128/1344").


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L240 and https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L285 Consider removing LE64(length). For a KDF, let's say someone uses the exact output from Balloon then later realizes they need another key while preserving the original key. Then they would have to run Balloon twice, but if you remove LE64(length) then they can just ask for more output. Also in the Security Considerations section it should be stated to only run the password KDF once and derive all the keys needed from that output. Since an attacker will run which ever password KDF that will give them a password check (e.g. encryptionKey = Balloon(...); macKey = Balloon(...); and only do macKey = Balloon(...); checkMac(macKey, mac, data)).


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L289-L294 Consider changing this to:

previous = buffer[spaceCost - 1]
for t = 0 to timeCost - 1
    for m = 0 to spaceCost - 1

And add previous = buffer[m] to the end of the loop just after buffer[m] is set. A lot of implementations copied the paper verbatim on this or used an if statement. Also you can just return previous.


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L296 This should be more like the academic code. Where it's a bit stream and you grab what you need from it. https://github.com/henrycg/balloon/blob/35d7da79f42b4f9078ba6f39e1e0a5a37bafa4c1/libballoon/hash_state.c#L167 Since spaceCost is a power of 2 you could grab the exact number of bits needed which will save calculations. Also reconsider including the salt because PAKEs with an OPRF will have a salt that includes the password (oprfSalt = hash(password) ** serverSalt). Maybe something like Hash(LE64(counter++) || LE64(spaceCost) || LE64(timeCost) || LE64(parallelism) || LE64(iteration)). Also the paper's version is Hash(LE64(counter++) || salt || LE64(iteration)). Unless you want the access pattern to be the same across all threads thus enabling SIMD and better bandwidth utilization (but there's no mention of this).


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L305 Add:


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L338 See https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/issues/2#issuecomment-2177344918 Note the current draft is 3 H/block for an attacker which is the same for the "academic code" minimum settings.


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L340 OPRF salts are "variable" length based on elliptic curve or finite field prime. Also see https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/issues/7#issuecomment-2177358941 about "extra salts". Also for password hashing this can be as low at like 32 bits unless you're Facebook scale then 40-48 bits. It's a whole thing about max collisions being relatively low and most are unique. BUT obviously there's not much difference in storage space than 128 bit salts.


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L342 For password hashing, anything at least 80 bits is fine for anything properly key stretched but NIST would probably want at least 112 bits. BUT obviously there's not much difference in storage space than 128 bit hash. For key derivation, NIST recommends at least 112 bit (because of 3DES). Also this contradicts at least 128 bits and at most 256 bits from: https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L374


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L352 vs https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L363 Shouldn't that be $balloon-sha256$ or is it blake2b-512?


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L376 "Similarly, systems MUST check for overly large user-specified parameters (e.g. passwords) to prevent denial-of-service attacks." That's only for poorly designed algorithms which Balloon should not be. Currently it hashes the password and salt parallelism+1 times. It would be better if it was just once or twice.

There are 3 steps to a good key stretching algorithm: 1) Hash inputs: seed = H(inputs); [seedNoSecrets = H(non-secret-inputs)] 2) Do work: work = doWork(settings, seed[, seedNoSecrets]) 3) Output: out = kdf(work, (inputs or seed), outSize, ...)

One can make a good key stretching algorithm without those steps. This just prevents bad design choices (CVE-2014-9034, CVE-2016-20013) and bad implementations (CVE-2013-1443).


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L380 "authenticated encryption with associated data (AEAD) scheme" No, authenticated is not needed since it's a password hash that can be replaced. It's more important to be deterministically encrypted. In fact this is the one time ECB is safe to use unless you want to store more than one blocks worth of hash. In that case use NULL-IV-CBC-CTS. This way you can have an HSM that only encrypts and the server can compare encrypted outputs. I just realized this doesn't mention that the only thing that needs to be encrypted is the hash part. Also it should be decoded into binary data before encrypted.


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L382 NIST would probably say at least 112 bits and at most 256 bits.


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L392 It's more of a bandwidth vs memory transactions thing with GPUs. GPUs have wide memory buses. 128-384 bits for GeForce RTX 4000 series and Radeon RX 7000 series and historically up to 512 bits of GDDR. Other GPUs with HBM (vs GDDR) have bus widths of 1024 to 5120 bits.


https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L461 Test Vector 4 has an invalid spaceCost of 3.

samuel-lucas6 commented 2 months ago

Hi Steve, thanks for your feedback; I'll definitely be giving you a mention in the Acknowledgements. I'll start working on these and do a proper reply at the weekend when I'm hopefully feeling fresh.

Some of those things just haven't been updated whilst other parts have, like the test vectors. One major planned change is switching Hash() to PRF() to allow the use of HMAC. Then unkeyed hashes would use prefix MAC with the key padded to the block size. It's not ideal trying to cater for every possible hash function.

Sc00bz commented 2 months ago

One major planned change is switching Hash() to PRF() to allow the use of HMAC.

This will make it slower and it's already slow. I guess this will only make it 33% slower (3 vs 4 H/block). Assuming cached HMAC.

Then unkeyed hashes would use prefix MAC with the key padded to the block size.

Ah this won't make it slower. Assuming the implementation caches the state after the key is added. Unlike CVE-2013-1443, PBKDF2 DoS with long password. Besides doing twice the work for normal sized passwords (4*iterations vs 2*iterations+2).

samuel-lucas6 commented 2 months ago

Big thanks for this again. There's some great stuff here.

List should be called Array. This can be called an array of arrays or 2D array. Maybe consider ByteArray and BlockArray which cover the two use cases.

Yes, that's a good idea.

Consider making these 32 bit functions and keeping LE64(x) for counter. Since these are all defined as 32 bit max integers. Except MAX_SPACECOST but that's 0 which is less than MIN_SPACECOST. Also related:

LE64() was chosen for consistency everywhere, but I agree those could be switched to LE32(). The MAX_ parameters were chosen to avoid the counter overflowing.

Consider making spaceCost an integer 0 to 32 and represents 2**spaceCost blocks of memory.

That's sensible now it's a power of 2.

Replace "integer" with "float" or "floating point". Or remove "the integer".

How about the number x?

All mentions of XOF should be removed except this one because you are converting an XOF into a hash and this part might be missed.

Fair point.

Also you are restricting XOFs more so than necessary. XOFs should not output more data than its internal state and should only call the internal base hash function the least amount of times to output anything (e.g. SHAKE128 has an internal state of 1600 bits but will call the internal base hash function more than the minimum for >1344 bits. Thus the max output for SHAKE128 is 1344 bits thus "SHAKE128-1344" or is it "SHAKE128/1344").

I see what you mean. This was restricted for simplicity in terms of explaining what to do and because there are APIs where you use an XOF as a hash function. As an example, BLAKE3 can be used as a hash function or an XOF, which would lead to two incompatible variants for the same algorithm.

Consider removing LE64(length). For a KDF, let's say someone uses the exact output from Balloon then later realizes they need another key while preserving the original key. Then they would have to run Balloon twice, but if you remove LE64(length) then they can just ask for more output.

I get you. This was inspired by Argon2's variable-length hash function.

Also in the Security Considerations section it should be stated to only run the password KDF once and derive all the keys needed from that output. Since an attacker will run which ever password KDF that will give them a password check (e.g. encryptionKey = Balloon(...); macKey = Balloon(...); and only do macKey = Balloon(...); checkMac(macKey, mac, data)).

I agree. This is actually something I've been discussing with Henry because he proposed getting rid of the KDF and running Balloon repeatedly to generate more output like PBKDF2, which I've argued is a bad idea for this reason/because it reduces the parameters.

Consider changing this to: previous = buffer[spaceCost - 1] for t = 0 to timeCost - 1 for m = 0 to spaceCost - 1 And add previous = buffer[m] to the end of the loop just after buffer[m] is set. A lot of implementations copied the paper verbatim on this or used an if statement. Also you can just return previous.

Good spot. I'll rearrange this.

This should be more like the academic code. Where it's a bit stream and you grab what you need from it. https://github.com/henrycg/balloon/blob/35d7da79f42b4f9078ba6f39e1e0a5a37bafa4c1/libballoon/hash_state.c#L167 Since spaceCost is a power of 2 you could grab the exact number of bits needed which will save calculations.

I'm not sure what you mean here. It's written as pseudorandom = Hash(LE64(counter++) || salt) because hash APIs output a full hash.

Also reconsider including the salt because PAKEs with an OPRF will have a salt that includes the password (oprfSalt = hash(password) ** serverSalt). Maybe something like Hash(LE64(counter++) || LE64(spaceCost) || LE64(timeCost) || LE64(parallelism) || LE64(iteration)).

That's an interesting point. That was just from the paper.

Also the paper's version is Hash(LE64(counter++) || salt || LE64(iteration)). Unless you want the access pattern to be the same across all threads thus enabling SIMD and better bandwidth utilization (but there's no mention of this).

iteration was added recently rather than having domain separation in the salt, and I managed to miss that.

Add: x % spaceCost can be done with a bit mask x & (spaceCost - 1) because spaceCost is a power of 2. Instead of Ceiling(length / HASH_LEN) one can do Floor((length + HASH_LEN - 1) / HASH_LEN) or (length + HASH_LEN - 1) / HASH_LEN where / is integer divide. (Side note: I've seen people include math.h and convert to and from floating points.)

I'll add those.

See https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/issues/2#issuecomment-2177344918 Note the current draft is 3 H/block for an attacker which is the same for the "academic code" minimum settings.

Thanks for investigating that.

OPRF salts are "variable" length based on elliptic curve or finite field prime. Also see https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/issues/7#issuecomment-2177358941 about "extra salts". Also for password hashing this can be as low at like 32 bits unless you're Facebook scale then 40-48 bits. It's a whole thing about max collisions being relatively low and most are unique. BUT obviously there's not much difference in storage space than 128 bit salts.

Yeah, I haven't thought about PAKEs at all. They're not something I've looked into, and it doesn't sound like they're frequently used. The 128 bits recommendation is following the Argon2 RFC.

For password hashing, anything at least 80 bits is fine for anything properly key stretched but NIST would probably want at least 112 bits. BUT obviously there's not much difference in storage space than 128 bit hash. For key derivation, NIST recommends at least 112 bit (because of 3DES).

The password hash length recommendation could be changed to 128 bits. I'd prefer to leave the key derivation recommendation as is. These are only recommendations, not MUSTs.

Also this contradicts at least 128 bits and at most 256 bits from:

I don't think that does contradict anything because it's referring to the salt, which I've recommended 128 or 256 bits for before. SHOULD is equivalent to RECOMMENDED, and SHOULD NOT is equivalent to NOT RECOMMENDED, so they're not absolutes again.

Shouldn't that be $balloon-sha256$ or is it blake2b-512?

Yep, that hash needs to be updated. It was a SHA-256 example because the Rust implementation, which the draft was originally based on, used SHA-256 test vectors.

"Similarly, systems MUST check for overly large user-specified parameters (e.g. passwords) to prevent denial-of-service attacks." That's only for poorly designed algorithms which Balloon should not be. Currently it hashes the password and salt parallelism+1 times. It would be better if it was just once or twice.

Yes, I see that now. That's what people have meant about prehashing the password. The password will only be hashed once when the algorithm is switched to being keyed.

"authenticated encryption with associated data (AEAD) scheme" No, authenticated is not needed since it's a password hash that can be replaced. It's more important to be deterministically encrypted. In fact this is the one time ECB is safe to use unless you want to store more than one blocks worth of hash. In that case use NULL-IV-CBC-CTS. This way you can have an HSM that only encrypts and the server can compare encrypted outputs. I just realized this doesn't mention that the only thing that needs to be encrypted is the hash part. Also it should be decoded into binary data before encrypted.

I don't know how this is done in practice, but I've said AEAD schemes because they're more commonly used in general for encryption. I first heard about this idea from password_lock, which claims to do authenticated encryption and I think encrypts a string. I can mention that only the hash needs to be encrypted.

NIST would probably say at least 112 bits and at most 256 bits.

This is only a recommendation, and if NIST wanted to approve this, they could make their own recommendations. I don't want to be recommending anything below 128 bits.

It's more of a bandwidth vs memory transactions thing with GPUs. GPUs have wide memory buses. 128-384 bits for GeForce RTX 4000 series and Radeon RX 7000 series and historically up to 512 bits of GDDR. Other GPUs with HBM (vs GDDR) have bus widths of 1024 to 5120 bits.

Yeah, I won't pretend to understand cache hardness. It could do with being properly defined and explained somewhere in the literature. It sounds like you need an understanding of GPU architecture.

Test Vector 4 has an invalid spaceCost of 3.

The test vectors are from the old Rust implementation so are out of date. They'll be updated when the design is finalised, and it would be good to have test vectors for all the major hash functions, although that will be a pain.

This will make it slower and it's already slow. I guess this will only make it 33% slower (3 vs 4 H/block). Assuming cached HMAC.

Ah this won't make it slower. Assuming the implementation caches the state after the key is added. Unlike https://github.com/advisories/GHSA-4c42-4rxm-x6qf, PBKDF2 DoS with long password. Besides doing twice the work for normal sized passwords (4iterations vs 2iterations+2).

The idea is to allow the use of HMAC because that's probably what NIST wants whilst allowing everyone else to use an ordinary hash function/non-approved algorithm instead (e.g., BLAKE2). It's not ideal though because of things like keyed BLAKE2/BLAKE3 vs prefix MAC for BLAKE2/BLAKE3, which may cause confusion/interoperability issues.

The plan is to do key = PRF(emptyKey, password || salt || password.Length || salt.Length) and then use that key as the key everywhere else, except when computing the pseudorandom bytes since that should be password-independent. That can use an empty key again. This isn't ideal with prefix MAC but is like HKDF-Extract with HMAC.

This would allow easily supporting a pepper because that initial emptyKey could be the pepper, and it would be restricted to something like 512 bits to avoid HMAC key hashing, which was a HMAC design flaw.

Sc00bz commented 2 months ago

Replace "integer" with "float" or "floating point". Or remove "the integer".

How about the number x?

Yeah that's fine.

Consider removing LE64(length). For a KDF, let's say someone uses the exact output from Balloon then later realizes they need another key while preserving the original key. Then they would have to run Balloon twice, but if you remove LE64(length) then they can just ask for more output.

I get you. This was inspired by Argon2's variable-length hash function.

PBKDF2 doesn't do this. Which means Balloon doesn't need to do this. And it's better if it doesn't.

Also in the Security Considerations section it should be stated to only run the password KDF once and derive all the keys needed from that output. Since an attacker will run which ever password KDF that will give them a password check (e.g. encryptionKey = Balloon(...); macKey = Balloon(...); and only do macKey = Balloon(...); checkMac(macKey, mac, data)).

I agree. This is actually something I've been discussing with Henry because he proposed getting rid of the KDF and running Balloon repeatedly to generate more output like PBKDF2, which I've argued is a bad idea for this reason/because it reduces the parameters.

That is a bad idea. PBKDF2's footgun exists because they wanted to do the least amount of changes/effort to make PBKDF output variable length.

Microsoft used PBKDF2-HMAC-SHA1(pw, salt, iterations:4096, outLen:256 bits) as a hash because "it's good enough for WiFi". But an attacker only needs to calculate the first 160 bits. Doing half the work as a defender.

1Password used key || iv = PBKDF2-HMAC-SHA256(pw, salt, iterations:100000, outLen:384 bits) to decrypt a 256 bit key in CBC mode. Again an attacker only needs the first 256 bits and decrypt the padding and compare to "\x10\x10...\x10". Doing half the work as a defender.

I'm sure there are several others.

This should be more like the academic code. Where it's a bit stream and you grab what you need from it. https://github.com/henrycg/balloon/blob/35d7da79f42b4f9078ba6f39e1e0a5a37bafa4c1/libballoon/hash_state.c#L167 Since spaceCost is a power of 2 you could grab the exact number of bits needed which will save calculations.

I'm not sure what you mean here. It's written as pseudorandom = Hash(LE64(counter++) || salt) because hash APIs output a full hash.

Oh the counter is the same... OK counter = Ceiling(3 * spaceCostLog2 * 2^spaceCostLog2 * timeCost / HASH_BIT_LEN) instead of 0. Or have different counters maybe H("0" || LE64(counter0++) || ...) and H("1" || LE64(counter1++) || ...). And give a description or an example like pseudorandomStream={0x21, 0x43, 0x65, 0x87, ...} and pseudorandomStream.popBitsLE(12) returns 0x321 and the next call returns 0x654.

counter = Ceiling(((3 * spaceCostLog2 * timeCost) << spaceCostLog2) / HASH_BIT_LEN)
pseudorandomStream =
    PRF(ZeroPad(emptyKey, HASH_LEN), LE64(0) || LE64(iteration) || salt) ||
    PRF(ZeroPad(emptyKey, HASH_LEN), LE64(1) || LE64(iteration) || salt) ||
    PRF(ZeroPad(emptyKey, HASH_LEN), LE64(2) || LE64(iteration) || salt) ||
    ...
    PRF(ZeroPad(emptyKey, HASH_LEN), LE64(counter-1) || LE64(iteration) || salt)

buffer[0] = PRF(key, LE64(counter++) || LE64(spaceCost) || LE64(timeCost) || LE64(parallelism) || LE64(length) || LE64(iteration))
for m = 1 to spaceCost - 1
    buffer[m] = PRF(key, LE64(counter++) || buffer[m - 1])

emptyKey = ByteArray(0)
previous = buffer[spaceCost - 1]
for t = 0 to timeCost - 1
    for m = 0 to spaceCost - 1
        other1 = pseudorandomStream.popBitsLE(spaceCostLog2)
        other2 = pseudorandomStream.popBitsLE(spaceCostLog2)
        other3 = pseudorandomStream.popBitsLE(spaceCostLog2)
        buffer[m] = PRF(key, LE64(counter++) || previous || buffer[m] || buffer[other1] || buffer[other2] || buffer[other3])
        previous = buffer[m]

Also reconsider including the salt because PAKEs with an OPRF will have a salt that includes the password (oprfSalt = hash(password) ** serverSalt). Maybe something like Hash(LE64(counter++) || LE64(spaceCost) || LE64(timeCost) || LE64(parallelism) || LE64(iteration)).

That's an interesting point. That was just from the paper.

I remember there being a conversation on the PHC mailing list about whether salt should be included for secret independent reads. I remember at least one submission that did this. I'm pretty sure it was determined to be better not to include salt. Also Argon2i doesn't include the salt.

OPRF salts are "variable" length based on elliptic curve or finite field prime. Also see https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/issues/7#issuecomment-2177358941 about "extra salts". Also for password hashing this can be as low at like 32 bits unless you're Facebook scale then 40-48 bits. It's a whole thing about max collisions being relatively low and most are unique. BUT obviously there's not much difference in storage space than 128 bit salts.

Yeah, I haven't thought about PAKEs at all. They're not something I've looked into, and it doesn't sound like they're frequently used. The 128 bits recommendation is following the Argon2 RFC.

WhatsApp, and 1Password use PAKEs. I think also Apple and Signal. The RFC recommends 128 bit salts, but they are very lax on requirements 8 to 2^(32)-1 bytes:

"Nonce S, which is a salt for password hashing applications. It MUST have a length not greater than 2^(32)-1 bytes. 16 bytes is RECOMMENDED for password hashing. The salt SHOULD be unique for each password." https://www.rfc-editor.org/rfc/rfc9106.html#section-3.1-2.2

"Select the salt length. A length of 128 bits is sufficient for all applications but can be reduced to 64 bits in the case of space constraints." https://www.rfc-editor.org/rfc/rfc9106.html#section-4-6.7

Also this contradicts at least 128 bits and at most 256 bits from:

I don't think that does contradict anything because it's referring to the salt, which I've recommended 128 or 256 bits for before. SHOULD is equivalent to RECOMMENDED, and SHOULD NOT is equivalent to NOT RECOMMENDED, so they're not absolutes again.

Yeah I read that wrong.

This will make it slower and it's already slow. I guess this will only make it 33% slower (3 vs 4 H/block). Assuming cached HMAC.

Ah this won't make it slower. Assuming the implementation caches the state after the key is added. Unlike https://github.com/advisories/GHSA-4c42-4rxm-x6qf, PBKDF2 DoS with long password. Besides doing twice the work for normal sized passwords (4iterations vs 2iterations+2).

The idea is to allow the use of HMAC because that's probably what NIST wants whilst allowing everyone else to use an ordinary hash function/non-approved algorithm instead (e.g., BLAKE2). It's not ideal though because of things like keyed BLAKE2/BLAKE3 vs prefix MAC for BLAKE2/BLAKE3, which may cause confusion/interoperability issues.

The plan is to do key = PRF(emptyKey, password || salt || password.Length || salt.Length) and then use that key as the key everywhere else, except when computing the pseudorandom bytes since that should be password-independent. That can use an empty key again. This isn't ideal with prefix MAC but is like HKDF-Extract with HMAC.

This would allow easily supporting a pepper because that initial emptyKey could be the pepper, and it would be restricted to something like 512 bits to avoid HMAC key hashing, which was a HMAC design flaw.

Prefix MAC for BLAKE2/BLAKE3 is bad because it doesn't do the block operation until there's more than a block of data. Since the last block can be full. I feel like I'm saying this weirdly. The last block has a flag that states it's the last block and it can be full and no data is added for padding. Thus giving it a full block after initialization it will wait for finish to be called or more data before it mixes the block. So if it was PRF(key, BE64(counter++) || previous || buffer[m] || buffer[other1] || buffer[other2] || buffer[other3]) then one could blake2b_update(ctx, ZeroPad(key, 128)), blake2b_update(ctx, "\0"), make a copy of ctx, blake2b_update(ctx, BE56(counter++))... Or if it was PRF(key, "1" || LE64(counter1++) || previous || buffer[m] || buffer[other1] || buffer[other2] || buffer[other3]) (using two counters like above).

It would be best to state that Balloon with prefix MAC is invalid for BLAKE2 and BLAKE3. And that the only valid method is using the built in MAC for BLAKE2 and BLAKE3.

samuel-lucas6 commented 2 months ago

Sorry for my slow replies, got a lot going on at the moment.

PBKDF2 doesn't do this. Which means Balloon doesn't need to do this. And it's better if it doesn't.

Yeah, I'll try to remove this at the weekend.

That is a bad idea. PBKDF2's footgun exists because they wanted to do the least amount of changes/effort to make PBKDF output variable length.

Thanks for the examples. I was aware of 1Password having a problem.

However, it turns out that I've misunderstood what Henry meant. His proposal is actually to grab more bytes/blocks from the final mixed buffer, meaning you don't have to do any extra hashing (no KDF and no rerunning the algorithm).

Of course, the obvious limitation is that the output length is now limited by the space cost. To get around that, you end up rerunning the Mix phase, which introduces this PBKDF2 footgun again and complicates the code. Another issue might be with collision resistance when parallelism is used.

I understand the appeal, but I think using a KDF is cleaner. I'm interested to know your thoughts.

Oh the counter is the same... OK counter = Ceiling(3 spaceCostLog2 2^spaceCostLog2 * timeCost / HASH_BIT_LEN) instead of 0. Or have different counters maybe H("0" || LE64(counter0++) || ...) and H("1" || LE64(counter1++) || ...). And give a description or an example like pseudorandomStream={0x21, 0x43, 0x65, 0x87, ...} and pseudorandomStream.popBitsLE(12) returns 0x321 and the next call returns 0x654.

I'm a bit confused by your last sentence, and is the purpose of using separate counters to allow easier precomputation?

I remember there being a conversation on the PHC mailing list about whether salt should be included for secret independent reads. I remember at least one submission that did this. I'm pretty sure it was determined to be better not to include salt. Also Argon2i doesn't include the salt.

I'll try to find that.

WhatsApp, and 1Password use PAKEs. I think also Apple and Signal. The RFC recommends 128 bit salts, but they are very lax on requirements 8 to 2^(32)-1 bytes:

Yeah, I can imagine some of the bigger companies using them/them being used more in the future.

And yes, I could add a comment about smaller salts being acceptable, but the wording doesn't prohibit any size of salt from being used currently. It can even be empty because that's what the scrypt and Argon2 RFCs seem to allow.

It would be best to state that Balloon with prefix MAC is invalid for BLAKE2 and BLAKE3. And that the only valid method is using the built in MAC for BLAKE2 and BLAKE3.

I've been planning to make it a bit clearer. I'll say something like If the hash function supports a key parameter (e.g. BLAKE2 [RFC]), it MUST be used. Otherwise, if the hash function does not have a key parameter, prefix MAC MUST be used....

Sc00bz commented 2 months ago

Sorry for my slow replies, got a lot going on at the moment.

Don't worry about.

That is a bad idea. PBKDF2's footgun exists because they wanted to do the least amount of changes/effort to make PBKDF output variable length.

Thanks for the examples. I was aware of 1Password having a problem.

However, it turns out that I've misunderstood what Henry meant. His proposal is actually to grab more bytes/blocks from the final mixed buffer, meaning you don't have to do any extra hashing (no KDF and no rerunning the algorithm).

Of course, the obvious limitation is that the output length is now limited by the space cost. To get around that, you end up rerunning the Mix phase, which introduces this PBKDF2 footgun again and complicates the code. Another issue might be with collision resistance when parallelism is used.

I understand the appeal, but I think using a KDF is cleaner. I'm interested to know your thoughts.

Definitely KDF. If you output the first hash from the memory an attacker can skip the last iteration. Also currently it's 4*t*m*p hash blocks to do the main work and HKDF is 2*ceiling(outputLength/hashLength)+6 hash blocks. Getting 4*t*m*p down to ~3.2*t*m*p will benefit much more. Using a pseudorandom stream will remove about 0.8*t*m*p hash blocks.

Oh the counter is the same... OK counter = Ceiling(3 spaceCostLog2 2^spaceCostLog2 * timeCost / HASH_BIT_LEN) instead of 0. Or have different counters maybe H("0" || LE64(counter0++) || ...) and H("1" || LE64(counter1++) || ...). And give a description or an example like pseudorandomStream={0x21, 0x43, 0x65, 0x87, ...} and pseudorandomStream.popBitsLE(12) returns 0x321 and the next call returns 0x654.

I'm a bit confused by your last sentence

pseudorandomStream={0x21, 0x43, 0x65, 0x87, ...} is defining the example byte data in pseudorandomStream and pseudorandomStream.popBitsLE(12) gets 12 bits from pseudorandomStream which is 0x321 == (0x21 | (0x43 << 8)) & ((1 << 12) - 1). Now pseudorandomStream={0x4 (4 bits), 0x65, 0x87, ...} and the next pseudorandomStream.popBitsLE(12) call returns 0x654 == (0x4 | (0x65 << 4)) & ((1 << 12) - 1).

and is the purpose of using separate counters to allow easier precomputation?

No, I was trying to keep the counters from overlapping or having explicit domain separation with "0" and "1".

X = Ceiling(3 * spaceCostLog2 * 2^spaceCostLog2 * timeCost / HASH_BIT_LEN)

If you start the pseudorandom stream counter at 0 it will end at X-1. Starting the other counter at X will make sure they don't overlap. You could just start both at zero because you're hashing two different fixed length strings (1, 64 bit int and 4, 32 bit ints vs 1, 64 bit int and 5 hashes).

samuel-lucas6 commented 2 months ago

Ok I've addressed most of this feedback now. I think the outstanding points are:

I like the spaceCost idea from a user's perspective, but it feels a bit awkward to write in the draft. I already wrote it up and reversed the changes.

I don't understand how the pseudorandom stream idea affects the hash blocks because you still need to compute the same number of hashes surely. If we ignore the separate counter thing, it just feels like moving the computation from inside the loop to before any loops.

Sc00bz commented 2 months ago
  • LE32() for lengths

It doesn't really matter except for here and with SHA-256 because this will be 2 blocks:

pseudorandom = PRF(ZeroPad(emptyKey, HASH_LEN),
                   LE64(counter++) || LE64(spaceCost) || LE64(timeCost) || LE64(parallelism) || LE64(iteration))

Wait a second shouldn't HASH_LEN be HASH_BLOCK_LEN?

  • spaceCost being an integer between 1-32

[...] I like the spaceCost idea from a user's perspective, but it feels a bit awkward to write in the draft. I already wrote it up and reversed the changes.

Well 0-32 because current range is 1 (2**0) to 4294967296 (2**32). This also doesn't need to be in the specification. It will be like scrypt's N. Which some implementations ask for log2 of N. I don't know how it's stored in the standard hash...

$7$DU..../....2Q9obwLhin8qvQl6sisAO/$sHayJj/JBdcuD4lJ1AxiwCo9e5XSi8TcINcmyID12i8

OK?... Ahh https://github.com/besser82/libxcrypt/blob/72f75aa370ae96ccd2cc44ea3cf4182d8679ffbe/lib/crypt-scrypt.c#L222 it's log2 of N. That hash decodes to N=1<<b64ToInt("D"), r=b64ToInt("U...."), p=b64ToInt("/...."), salt="2Q9obwLhin8qvQl6sisAO/", hash="sHayJj/JBdcuD4lJ1AxiwCo9e5XSi8TcINcmyID12i8"... I give up I'll just assume it's base64 with "./0-9A-Za-z". So N=2**15, r=32, p=1. I hate this decoder key BS to figure out the settings. "Wow you saved 2 characters and wasted an hour of my time" (and likely several other people's time too).

I also found a non-standard one that did N$r$p$salt$hash.

  • A separate counter for the memory accesses

[...] I don't understand how the pseudorandom stream idea affects the hash blocks because you still need to compute the same number of hashes surely. If we ignore the separate counter thing, it just feels like moving the computation from inside the loop to before any loops.

Using the pseudorandom stream idea with SHA-512 and 4 MiB (spaceCost=2**16), you do 1 hash for every 10.67 blocks vs every block. With those settings, you are only using 16 bits for each random offset (48 bits/block and 512/48=10.67 blocks). I benchmarked this on a i5-6500 with DDR4-2133 (average of 10 runs) and it's 368 ms vs 555 ms (SHA-512, 4 MiB (spaceCost=2**16), timeCost=3) and 643 ms vs 795 ms (SHA-256, 4 MiB (spaceCost=2**17), timeCost=3). I did 1 MiB to 64 MiB and without the pseudorandom stream it takes 1.57x to 1.45x longer for SHA-512 and 1.29x to 1.14x longer for SHA-256. If you graph these they both start at the maximum at 1 MiB then decrease to the minimum at 16 MiB then increase a little. I also did SHAKE128/1024 and it's 1.33x to 1.24x. I need a better implementation of SHAKE128 that uses AVX2. Also it would be interesting to see how SHA-256 performs with SHA instructions (and SHA-512 when CPUs with those instructions are available).

Oh right, the SHA-256 code encodes spaceCost, timeCost, parallelism, and iteration as 32 bit integers. I also did everything in "hash endian" to skip conversions. Also I used functions hash_updateNative(...) and hash_finishNative(...) that take uint32_ts or uint64_ts depending on hash.

Sc00bz commented 2 months ago

Wait a second shouldn't HASH_LEN be HASH_BLOCK_LEN?

Oops, no. PRF(key, message) I should of noticed the comma vs concatenation since I split the line right there. I was thinking of prefix MAC. Also I said "this will be 2 blocks" I meant with cached prefix MAC or cached keyed BLAKE2 etc.

samuel-lucas6 commented 2 months ago

It doesn't really matter except for here and with SHA-256 because this will be 2 blocks:

pseudorandom = PRF(ZeroPad(emptyKey, HASH_LEN), LE64(counter++) || LE64(spaceCost) || LE64(timeCost) || LE64(parallelism) || LE64(iteration))

Ignoring the key, I think the message should fit in 1 block (40 bytes/320 bits when the block size is 512 bits).

Wait a second shouldn't HASH_LEN be HASH_BLOCK_LEN?

It's a bit confusing, but I say you MUST perform prefix MAC and pad the key with zeros to the block size (1024 bits for SHA-512) for non-keyed hashes in the Conventions and Definitions section.

The reason the key is padded to HASH_LEN here is because that's what HKDF-Extract does for the salt, and you can't write HASH_BLOCK_LEN because that would exceed the BLAKE2 key size, for example. It's problematic to specify when the supported key size varies by algorithm.

Well 0-32 because current range is 1 (20) to 4294967296 (232). This also doesn't need to be in the specification. It will be like scrypt's N. Which some implementations ask for log2 of N.

Yeah, my bad.

Good point, although it would encourage a consistent API like what's been done with the associatedData parameter. Otherwise, I doubt most people would implement it that way. And as you say, the stored hash should be consistent and not difficult to parse.

Using the pseudorandom stream idea with SHA-512 and 4 MiB (spaceCost=2**16), you do 1 hash for every 10.67 blocks vs every block. With those settings, you are only using 16 bits for each random offset (48 bits/block and 512/48=10.67 blocks).

Oh I think I'm getting it now. So, the idea is to use more of each hash rather than using truncated hashes? Then the second part of the idea is to switch from LE64() to LE16() to get even more random indices out of each hash? Although how is 16 bits enough for a random index when spaceCost can be up to 4294967296?

Sc00bz commented 2 months ago

Just found a bug in my code. It's now faster:

SHA-512, 4 MiB (spaceCost=2**16), timeCost=3: 368 ms vs 555 ms 258 ms vs 327 ms

SHA-256, 4 MiB (spaceCost=2**17), timeCost=3: 643 ms vs 795 ms 465 ms vs 551 ms

1 MiB to 64 MiB and without the pseudorandom stream it takes 1.57x to 1.45x 1.30x to 1.17x longer for SHA-512 and 1.29x to 1.14x 1.22x to 1.08x longer for SHA-256. If you graph these they both start at the maximum at 1 MiB then decrease to the minimum at 16 32 MiB then increase a little. I also did SHAKE128/1024 and it's 1.33x to 1.24x 1.20x to 1.10x.

It doesn't really matter except for here and with SHA-256 because this will be 2 blocks: pseudorandom = PRF(ZeroPad(emptyKey, HASH_LEN), LE64(counter++) || LE64(spaceCost) || LE64(timeCost) || LE64(parallelism) || LE64(iteration))

Ignoring the key, I think the message should fit in 1 block (40 bytes/320 bits when the block size is 512 bits).

Wait a second shouldn't HASH_LEN be HASH_BLOCK_LEN?

It's a bit confusing, but I say you MUST perform prefix MAC and pad the key with zeros to the block size (1024 bits for SHA-512) for non-keyed hashes in the Conventions and Definitions section.

The reason the key is padded to HASH_LEN here is because that's what HKDF-Extract does for the salt, and you can't write HASH_BLOCK_LEN because that would exceed the BLAKE2 key size, for example. It's problematic to specify when the supported key size varies by algorithm.

Yeah ignore that. I think my brain seg faulted.

Using the pseudorandom stream idea with SHA-512 and 4 MiB (spaceCost=2**16), you do 1 hash for every 10.67 blocks vs every block. With those settings, you are only using 16 bits for each random offset (48 bits/block and 512/48=10.67 blocks).

Oh I think I'm getting it now. So, the idea is to use more of each hash rather than using truncated hashes? Then the second part of the idea is to switch from LE64() to LE16() to get even more random indices out of each hash? Although how is 16 bits enough for a random index when spaceCost can be up to 4294967296?

It's not always 16 bits. It's log 2 of spaceCost bits. Here's the code I used... maybe I should just post the whole thing and not just the pseudorandom stream. It has something that I was going to bring up after this issue is closed. Also HASH_UINT_BLOCK is 8 for SHA2 and 16 for SHAKE128. hash_uint is uint32_t for SHA-256 and uint64_t for SHA-512 and SHAKE128.

struct balloonRand_ctx
{
    uint64_t   streamCounter;
    hash_ctx  *ctx;
    hash_uint *stream;
    hash_uint *streamData;
    hash_uint *mem;
    uint32_t   memory; // log2(spaceCost)
    uint32_t   memoryMask; // spaceCost - 1
    uint32_t   streamDataLeft;
};

hash_uint *hash_balloonRand(balloonRand_ctx &ctx)
{
    hash_uint *streamData     = ctx.streamData;
    uint32_t   memory         = ctx.memory;
    uint32_t   streamDataLeft = ctx.streamDataLeft;
    size_t     curPos         = streamDataLeft / (8 * sizeof(hash_uint));
    uint32_t   bits           = streamDataLeft % (8 * sizeof(hash_uint));
    hash_uint  offset         = streamData[curPos];

    if (streamDataLeft < memory)
    {
        // Hash
        uint64_t streamCounter = ctx.streamCounter;
#ifdef HASH_UINT_IS_64
        ctx.stream[1] = streamCounter;
        hash_updateNative(ctx.ctx, ctx.stream + 1, 4); // 4 uint64_t's
#else
        ctx.stream[0] = (uint32_t) (streamCounter >> 32);
        ctx.stream[1] = (uint32_t)  streamCounter;
        hash_updateNative(ctx.ctx, ctx.stream, 5); // 5 uint32_t's
#endif
        hash_finishNative(ctx.ctx, streamData);
        ctx.streamCounter = streamCounter + 1;
        streamDataLeft += 8 * HASH_SIZE;

        // Add new stream data
        hash_uint tmp  = streamData[HASH_UINT_BLOCK - 1];
        offset        |= tmp << bits;

        // Update stream data
        streamData[HASH_UINT_BLOCK - 1] = tmp >> (memory - bits);
    }
    else if (bits < memory)
    {
        // Add stream data
        hash_uint tmp  = streamData[curPos - 1];
        offset        |= tmp << bits;

        // Update stream data
        streamData[curPos - 1] = tmp >> (memory - bits);
    }
    else
    {
        // Update stream data
        streamData[curPos] = offset >> memory;
    }
    ctx.streamDataLeft = streamDataLeft - memory;

    return ctx.mem + HASH_UINT_BLOCK * (offset & ctx.memoryMask);
}
balloonRand_ctx randCtx;
hash_ctx   ctx2;
hash_uint *mem;
hash_uint  stream[5] = {0, 0, memory, iterations, parallelism};
hash_uint  streamData[HASH_UINT_BLOCK];

mem = new hash_uint[((size_t) HASH_UINT_BLOCK) << memory];
randCtx.streamCounter  = 0;
randCtx.ctx            = &ctx2;
randCtx.stream         = stream;
randCtx.streamData     = streamData;
randCtx.mem            = mem;
randCtx.memory         = memory;
randCtx.memoryMask     = (1 << memory) - 1;
randCtx.streamDataLeft = 0;

// ...

hash_uint *other1 = hash_balloonRand(randCtx);
hash_uint *other2 = hash_balloonRand(randCtx);
hash_uint *other3 = hash_balloonRand(randCtx);
samuel-lucas6 commented 1 month ago

spaceCost being an integer between 0-32

This is done now. What do you think of the pseudocode for this/encoding 2^spaceCost rather than spaceCost?

Supporting larger outputs with XOFs (the block size/rate)

I've said 1024 bits (just realised there was a typo in that commit where I said 1024 bytes) if using an XOF.

I'm not especially happy with it. It works for SHAKE/KangarooTwelve/MarsupilamiFourteen but not BLAKE3. I'd say the block size, except that doesn't account for padding/being a multiple of a cache line. Maybe I should say something like the nearest power of 2 to the block size.

It's not always 16 bits. It's log 2 of spaceCost bits.

My concern is this looks more difficult to implement than just LE32() or LE64(), and ease of understanding/implementation is one of the main selling points of the algorithm.

samuel-lucas6 commented 1 month ago

@Sc00bz I think I've addressed all of your feedback now.

I'm not doing log2(spaceCost) at the moment but switched from LE64() to LE32() besides the counter and am precomputing the pseudorandom bytes, which should still be a significant performance boost (e.g., 3 random indices per hash => 7, 8, 12, 16, or 32 per hash depending on the hash function/XOF). The downside is the additional complexity and increased memory usage.

It has something that I was going to bring up after this issue is closed.

What are you talking about here? Are you happy to close this issue?

samuel-lucas6 commented 1 month ago

@Sc00bz I found the conversations on the phc-discussions mailing list you were talking about by converting the list to files and searching for salt, password-independent, data-independent, salt-dependent, visitation pattern, and access pattern:

Pros:

Cons:

On reflection, I think removing the salt from the access pattern derivation is a bad idea since the attacker potentially only has to precompute it once for all users in a database. This feels like a bigger problem than the cons.

I should've done this research prior to making the change really, but the mailing list has no method of searching and there are thousands of emails. It took hours to compile this list.


As for log2(spaceCost) vs LE32(spaceCost), I'm in two minds about this. The former is clearly more efficient, but it's more error prone and annoying to implement. There will be an existing API in standard libraries for LE32(), whereas the implementer has to do bit fiddling to grab something like 17 bits. Some values don't evenly divide either.

Balloon was really designed as an algorithm that just calls other existing functions. That's why the performance isn't optimal.

samuel-lucas6 commented 3 weeks ago

Since there's been no reply for a month, I'm going to close this issue. I'll open some separate issues for some outstanding questions.