Fixes and suggestions - Githubissues

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L184 List should be called Array. This can be called an array of arrays or 2D array. Maybe consider ByteArray and BlockArray which cover the two use cases.

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L186-L187 Consider making these 32 bit functions and keeping LE64(x) for counter. Since these are all defined as 32 bit max integers. Except MAX_SPACECOST but that's 0 which is less than MIN_SPACECOST. Also related: https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L218 Consider making spaceCost an integer 0 to 32 and represents 2**spaceCost blocks of memory.

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L189 Replace "integer" with "float" or "floating point". Or remove "the integer".

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L194 All mentions of XOF should be removed except this one because you are converting an XOF into a hash and this part might be missed. Also you are restricting XOFs more so than necessary. XOFs should not output more data than its internal state and should only call the internal base hash function the least amount of times to output anything (e.g. SHAKE128 has an internal state of 1600 bits but will call the internal base hash function more than the minimum for >1344 bits. Thus the max output for SHAKE128 is 1344 bits thus "SHAKE128-1344" or is it "SHAKE128/1344").

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L240 and https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L285 Consider removing LE64(length). For a KDF, let's say someone uses the exact output from Balloon then later realizes they need another key while preserving the original key. Then they would have to run Balloon twice, but if you remove LE64(length) then they can just ask for more output. Also in the Security Considerations section it should be stated to only run the password KDF once and derive all the keys needed from that output. Since an attacker will run which ever password KDF that will give them a password check (e.g. encryptionKey = Balloon(...); macKey = Balloon(...); and only do macKey = Balloon(...); checkMac(macKey, mac, data)).

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L289-L294 Consider changing this to:

previous = buffer[spaceCost - 1]
for t = 0 to timeCost - 1
    for m = 0 to spaceCost - 1

And add previous = buffer[m] to the end of the loop just after buffer[m] is set. A lot of implementations copied the paper verbatim on this or used an if statement. Also you can just return previous.

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L296 This should be more like the academic code. Where it's a bit stream and you grab what you need from it. https://github.com/henrycg/balloon/blob/35d7da79f42b4f9078ba6f39e1e0a5a37bafa4c1/libballoon/hash_state.c#L167 Since spaceCost is a power of 2 you could grab the exact number of bits needed which will save calculations. Also reconsider including the salt because PAKEs with an OPRF will have a salt that includes the password (oprfSalt = hash(password) ** serverSalt). Maybe something like Hash(LE64(counter++) || LE64(spaceCost) || LE64(timeCost) || LE64(parallelism) || LE64(iteration)). Also the paper's version is Hash(LE64(counter++) || salt || LE64(iteration)). Unless you want the access pattern to be the same across all threads thus enabling SIMD and better bandwidth utilization (but there's no mention of this).

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L305 Add:

x % spaceCost can be done with a bit mask x & (spaceCost - 1) because spaceCost is a power of 2.
Instead of Ceiling(length / HASH_LEN) one can do Floor((length + HASH_LEN - 1) / HASH_LEN) or (length + HASH_LEN - 1) / HASH_LEN where / is integer divide. (Side note: I've seen people include math.h and convert to and from floating points.)

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L338 See https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/issues/2#issuecomment-2177344918 Note the current draft is 3 H/block for an attacker which is the same for the "academic code" minimum settings.

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L340 OPRF salts are "variable" length based on elliptic curve or finite field prime. Also see https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/issues/7#issuecomment-2177358941 about "extra salts". Also for password hashing this can be as low at like 32 bits unless you're Facebook scale then 40-48 bits. It's a whole thing about max collisions being relatively low and most are unique. BUT obviously there's not much difference in storage space than 128 bit salts.

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L342 For password hashing, anything at least 80 bits is fine for anything properly key stretched but NIST would probably want at least 112 bits. BUT obviously there's not much difference in storage space than 128 bit hash. For key derivation, NIST recommends at least 112 bit (because of 3DES). Also this contradicts at least 128 bits and at most 256 bits from: https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L374

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L352 vs https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L363 Shouldn't that be $balloon-sha256$ or is it blake2b-512?

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L376 "Similarly, systems MUST check for overly large user-specified parameters (e.g. passwords) to prevent denial-of-service attacks." That's only for poorly designed algorithms which Balloon should not be. Currently it hashes the password and salt parallelism+1 times. It would be better if it was just once or twice.

There are 3 steps to a good key stretching algorithm: 1) Hash inputs: seed = H(inputs); [seedNoSecrets = H(non-secret-inputs)] 2) Do work: work = doWork(settings, seed[, seedNoSecrets]) 3) Output: out = kdf(work, (inputs or seed), outSize, ...)

One can make a good key stretching algorithm without those steps. This just prevents bad design choices (CVE-2014-9034, CVE-2016-20013) and bad implementations (CVE-2013-1443).

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L380 "authenticated encryption with associated data (AEAD) scheme" No, authenticated is not needed since it's a password hash that can be replaced. It's more important to be deterministically encrypted. In fact this is the one time ECB is safe to use unless you want to store more than one blocks worth of hash. In that case use NULL-IV-CBC-CTS. This way you can have an HSM that only encrypts and the server can compare encrypted outputs. I just realized this doesn't mention that the only thing that needs to be encrypted is the hash part. Also it should be decoded into binary data before encrypted.

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L382 NIST would probably say at least 112 bits and at most 256 bits.

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L392 It's more of a bandwidth vs memory transactions thing with GPUs. GPUs have wide memory buses. 128-384 bits for GeForce RTX 4000 series and Radeon RX 7000 series and historically up to 512 bits of GDDR. Other GPUs with HBM (vs GDDR) have bus widths of 1024 to 5120 bits.

https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/blob/bfa686ed8e1bf5e3b81d35316ba27d5d23bea41b/draft-lucas-balloon-hashing.md?plain=1#L461 Test Vector 4 has an invalid spaceCost of 3.

Hi Steve, thanks for your feedback; I'll definitely be giving you a mention in the Acknowledgements. I'll start working on these and do a proper reply at the weekend when I'm hopefully feeling fresh.

Some of those things just haven't been updated whilst other parts have, like the test vectors. One major planned change is switching Hash() to PRF() to allow the use of HMAC. Then unkeyed hashes would use prefix MAC with the key padded to the block size. It's not ideal trying to cater for every possible hash function.

One major planned change is switching Hash() to PRF() to allow the use of HMAC.

This will make it slower and it's already slow. I guess this will only make it 33% slower (3 vs 4 H/block). Assuming cached HMAC.

Then unkeyed hashes would use prefix MAC with the key padded to the block size.

Ah this won't make it slower. Assuming the implementation caches the state after the key is added. Unlike CVE-2013-1443, PBKDF2 DoS with long password. Besides doing twice the work for normal sized passwords (4*iterations vs 2*iterations+2).

Big thanks for this again. There's some great stuff here.

List should be called Array. This can be called an array of arrays or 2D array. Maybe consider ByteArray and BlockArray which cover the two use cases.

Yes, that's a good idea.

Consider making these 32 bit functions and keeping LE64(x) for counter. Since these are all defined as 32 bit max integers. Except MAX_SPACECOST but that's 0 which is less than MIN_SPACECOST. Also related:

LE64() was chosen for consistency everywhere, but I agree those could be switched to LE32(). The MAX_ parameters were chosen to avoid the counter overflowing.

Consider making spaceCost an integer 0 to 32 and represents 2**spaceCost blocks of memory.

That's sensible now it's a power of 2.

Replace "integer" with "float" or "floating point". Or remove "the integer".

How about the number x?

All mentions of XOF should be removed except this one because you are converting an XOF into a hash and this part might be missed.

Fair point.

Also you are restricting XOFs more so than necessary. XOFs should not output more data than its internal state and should only call the internal base hash function the least amount of times to output anything (e.g. SHAKE128 has an internal state of 1600 bits but will call the internal base hash function more than the minimum for >1344 bits. Thus the max output for SHAKE128 is 1344 bits thus "SHAKE128-1344" or is it "SHAKE128/1344").

I see what you mean. This was restricted for simplicity in terms of explaining what to do and because there are APIs where you use an XOF as a hash function. As an example, BLAKE3 can be used as a hash function or an XOF, which would lead to two incompatible variants for the same algorithm.

Consider removing LE64(length). For a KDF, let's say someone uses the exact output from Balloon then later realizes they need another key while preserving the original key. Then they would have to run Balloon twice, but if you remove LE64(length) then they can just ask for more output.

I get you. This was inspired by Argon2's variable-length hash function.

Also in the Security Considerations section it should be stated to only run the password KDF once and derive all the keys needed from that output. Since an attacker will run which ever password KDF that will give them a password check (e.g. encryptionKey = Balloon(...); macKey = Balloon(...); and only do macKey = Balloon(...); checkMac(macKey, mac, data)).

I agree. This is actually something I've been discussing with Henry because he proposed getting rid of the KDF and running Balloon repeatedly to generate more output like PBKDF2, which I've argued is a bad idea for this reason/because it reduces the parameters.

Consider changing this to: previous = buffer[spaceCost - 1] for t = 0 to timeCost - 1 for m = 0 to spaceCost - 1 And add previous = buffer[m] to the end of the loop just after buffer[m] is set. A lot of implementations copied the paper verbatim on this or used an if statement. Also you can just return previous.

Good spot. I'll rearrange this.

This should be more like the academic code. Where it's a bit stream and you grab what you need from it. https://github.com/henrycg/balloon/blob/35d7da79f42b4f9078ba6f39e1e0a5a37bafa4c1/libballoon/hash_state.c#L167 Since spaceCost is a power of 2 you could grab the exact number of bits needed which will save calculations.

I'm not sure what you mean here. It's written as pseudorandom = Hash(LE64(counter++) || salt) because hash APIs output a full hash.

Also reconsider including the salt because PAKEs with an OPRF will have a salt that includes the password (oprfSalt = hash(password) ** serverSalt). Maybe something like Hash(LE64(counter++) || LE64(spaceCost) || LE64(timeCost) || LE64(parallelism) || LE64(iteration)).

That's an interesting point. That was just from the paper.

Also the paper's version is Hash(LE64(counter++) || salt || LE64(iteration)). Unless you want the access pattern to be the same across all threads thus enabling SIMD and better bandwidth utilization (but there's no mention of this).

iteration was added recently rather than having domain separation in the salt, and I managed to miss that.

Add: x % spaceCost can be done with a bit mask x & (spaceCost - 1) because spaceCost is a power of 2. Instead of Ceiling(length / HASH_LEN) one can do Floor((length + HASH_LEN - 1) / HASH_LEN) or (length + HASH_LEN - 1) / HASH_LEN where / is integer divide. (Side note: I've seen people include math.h and convert to and from floating points.)

I'll add those.

See https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/issues/2#issuecomment-2177344918 Note the current draft is 3 H/block for an attacker which is the same for the "academic code" minimum settings.

Thanks for investigating that.

OPRF salts are "variable" length based on elliptic curve or finite field prime. Also see https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/issues/7#issuecomment-2177358941 about "extra salts". Also for password hashing this can be as low at like 32 bits unless you're Facebook scale then 40-48 bits. It's a whole thing about max collisions being relatively low and most are unique. BUT obviously there's not much difference in storage space than 128 bit salts.

Yeah, I haven't thought about PAKEs at all. They're not something I've looked into, and it doesn't sound like they're frequently used. The 128 bits recommendation is following the Argon2 RFC.

For password hashing, anything at least 80 bits is fine for anything properly key stretched but NIST would probably want at least 112 bits. BUT obviously there's not much difference in storage space than 128 bit hash. For key derivation, NIST recommends at least 112 bit (because of 3DES).

The password hash length recommendation could be changed to 128 bits. I'd prefer to leave the key derivation recommendation as is. These are only recommendations, not MUSTs.

Also this contradicts at least 128 bits and at most 256 bits from:

I don't think that does contradict anything because it's referring to the salt, which I've recommended 128 or 256 bits for before. SHOULD is equivalent to RECOMMENDED, and SHOULD NOT is equivalent to NOT RECOMMENDED, so they're not absolutes again.

Shouldn't that be $balloon-sha256$ or is it blake2b-512?

Yep, that hash needs to be updated. It was a SHA-256 example because the Rust implementation, which the draft was originally based on, used SHA-256 test vectors.

"Similarly, systems MUST check for overly large user-specified parameters (e.g. passwords) to prevent denial-of-service attacks." That's only for poorly designed algorithms which Balloon should not be. Currently it hashes the password and salt parallelism+1 times. It would be better if it was just once or twice.

Yes, I see that now. That's what people have meant about prehashing the password. The password will only be hashed once when the algorithm is switched to being keyed.

"authenticated encryption with associated data (AEAD) scheme" No, authenticated is not needed since it's a password hash that can be replaced. It's more important to be deterministically encrypted. In fact this is the one time ECB is safe to use unless you want to store more than one blocks worth of hash. In that case use NULL-IV-CBC-CTS. This way you can have an HSM that only encrypts and the server can compare encrypted outputs. I just realized this doesn't mention that the only thing that needs to be encrypted is the hash part. Also it should be decoded into binary data before encrypted.

I don't know how this is done in practice, but I've said AEAD schemes because they're more commonly used in general for encryption. I first heard about this idea from password_lock, which claims to do authenticated encryption and I think encrypts a string. I can mention that only the hash needs to be encrypted.

NIST would probably say at least 112 bits and at most 256 bits.

This is only a recommendation, and if NIST wanted to approve this, they could make their own recommendations. I don't want to be recommending anything below 128 bits.

It's more of a bandwidth vs memory transactions thing with GPUs. GPUs have wide memory buses. 128-384 bits for GeForce RTX 4000 series and Radeon RX 7000 series and historically up to 512 bits of GDDR. Other GPUs with HBM (vs GDDR) have bus widths of 1024 to 5120 bits.

Yeah, I won't pretend to understand cache hardness. It could do with being properly defined and explained somewhere in the literature. It sounds like you need an understanding of GPU architecture.

Test Vector 4 has an invalid spaceCost of 3.

The test vectors are from the old Rust implementation so are out of date. They'll be updated when the design is finalised, and it would be good to have test vectors for all the major hash functions, although that will be a pain.

This will make it slower and it's already slow. I guess this will only make it 33% slower (3 vs 4 H/block). Assuming cached HMAC.

Ah this won't make it slower. Assuming the implementation caches the state after the key is added. Unlike https://github.com/advisories/GHSA-4c42-4rxm-x6qf, PBKDF2 DoS with long password. Besides doing twice the work for normal sized passwords (4iterations vs 2iterations+2).

The idea is to allow the use of HMAC because that's probably what NIST wants whilst allowing everyone else to use an ordinary hash function/non-approved algorithm instead (e.g., BLAKE2). It's not ideal though because of things like keyed BLAKE2/BLAKE3 vs prefix MAC for BLAKE2/BLAKE3, which may cause confusion/interoperability issues.

The plan is to do key = PRF(emptyKey, password || salt || password.Length || salt.Length) and then use that key as the key everywhere else, except when computing the pseudorandom bytes since that should be password-independent. That can use an empty key again. This isn't ideal with prefix MAC but is like HKDF-Extract with HMAC.

This would allow easily supporting a pepper because that initial emptyKey could be the pepper, and it would be restricted to something like 512 bits to avoid HMAC key hashing, which was a HMAC design flaw.

Replace "integer" with "float" or "floating point". Or remove "the integer".

How about the number x?

Yeah that's fine.

Consider removing LE64(length). For a KDF, let's say someone uses the exact output from Balloon then later realizes they need another key while preserving the original key. Then they would have to run Balloon twice, but if you remove LE64(length) then they can just ask for more output.

I get you. This was inspired by Argon2's variable-length hash function.

PBKDF2 doesn't do this. Which means Balloon doesn't need to do this. And it's better if it doesn't.

Also in the Security Considerations section it should be stated to only run the password KDF once and derive all the keys needed from that output. Since an attacker will run which ever password KDF that will give them a password check (e.g. encryptionKey = Balloon(...); macKey = Balloon(...); and only do macKey = Balloon(...); checkMac(macKey, mac, data)).

I agree. This is actually something I've been discussing with Henry because he proposed getting rid of the KDF and running Balloon repeatedly to generate more output like PBKDF2, which I've argued is a bad idea for this reason/because it reduces the parameters.

That is a bad idea. PBKDF2's footgun exists because they wanted to do the least amount of changes/effort to make PBKDF output variable length.

Microsoft used PBKDF2-HMAC-SHA1(pw, salt, iterations:4096, outLen:256 bits) as a hash because "it's good enough for WiFi". But an attacker only needs to calculate the first 160 bits. Doing half the work as a defender.

1Password used key || iv = PBKDF2-HMAC-SHA256(pw, salt, iterations:100000, outLen:384 bits) to decrypt a 256 bit key in CBC mode. Again an attacker only needs the first 256 bits and decrypt the padding and compare to "\x10\x10...\x10". Doing half the work as a defender.

I'm sure there are several others.

This should be more like the academic code. Where it's a bit stream and you grab what you need from it. https://github.com/henrycg/balloon/blob/35d7da79f42b4f9078ba6f39e1e0a5a37bafa4c1/libballoon/hash_state.c#L167 Since spaceCost is a power of 2 you could grab the exact number of bits needed which will save calculations.

I'm not sure what you mean here. It's written as pseudorandom = Hash(LE64(counter++) || salt) because hash APIs output a full hash.

Oh the counter is the same... OK counter = Ceiling(3 * spaceCostLog2 * 2^spaceCostLog2 * timeCost / HASH_BIT_LEN) instead of 0. Or have different counters maybe H("0" || LE64(counter0++) || ...) and H("1" || LE64(counter1++) || ...). And give a description or an example like pseudorandomStream={0x21, 0x43, 0x65, 0x87, ...} and pseudorandomStream.popBitsLE(12) returns 0x321 and the next call returns 0x654.

counter = Ceiling(((3 * spaceCostLog2 * timeCost) << spaceCostLog2) / HASH_BIT_LEN)
pseudorandomStream =
    PRF(ZeroPad(emptyKey, HASH_LEN), LE64(0) || LE64(iteration) || salt) ||
    PRF(ZeroPad(emptyKey, HASH_LEN), LE64(1) || LE64(iteration) || salt) ||
    PRF(ZeroPad(emptyKey, HASH_LEN), LE64(2) || LE64(iteration) || salt) ||
    ...
    PRF(ZeroPad(emptyKey, HASH_LEN), LE64(counter-1) || LE64(iteration) || salt)

buffer[0] = PRF(key, LE64(counter++) || LE64(spaceCost) || LE64(timeCost) || LE64(parallelism) || LE64(length) || LE64(iteration))
for m = 1 to spaceCost - 1
    buffer[m] = PRF(key, LE64(counter++) || buffer[m - 1])

emptyKey = ByteArray(0)
previous = buffer[spaceCost - 1]
for t = 0 to timeCost - 1
    for m = 0 to spaceCost - 1
        other1 = pseudorandomStream.popBitsLE(spaceCostLog2)
        other2 = pseudorandomStream.popBitsLE(spaceCostLog2)
        other3 = pseudorandomStream.popBitsLE(spaceCostLog2)
        buffer[m] = PRF(key, LE64(counter++) || previous || buffer[m] || buffer[other1] || buffer[other2] || buffer[other3])
        previous = buffer[m]

Also reconsider including the salt because PAKEs with an OPRF will have a salt that includes the password (oprfSalt = hash(password) ** serverSalt). Maybe something like Hash(LE64(counter++) || LE64(spaceCost) || LE64(timeCost) || LE64(parallelism) || LE64(iteration)).

That's an interesting point. That was just from the paper.

I remember there being a conversation on the PHC mailing list about whether salt should be included for secret independent reads. I remember at least one submission that did this. I'm pretty sure it was determined to be better not to include salt. Also Argon2i doesn't include the salt.

OPRF salts are "variable" length based on elliptic curve or finite field prime. Also see https://github.com/samuel-lucas6/draft-lucas-balloon-hashing/issues/7#issuecomment-2177358941 about "extra salts". Also for password hashing this can be as low at like 32 bits unless you're Facebook scale then 40-48 bits. It's a whole thing about max collisions being relatively low and most are unique. BUT obviously there's not much difference in storage space than 128 bit salts.

Yeah, I haven't thought about PAKEs at all. They're not something I've looked into, and it doesn't sound like they're frequently used. The 128 bits recommendation is following the Argon2 RFC.

WhatsApp, and 1Password use PAKEs. I think also Apple and Signal. The RFC recommends 128 bit salts, but they are very lax on requirements 8 to 2^(32)-1 bytes:

"Nonce S, which is a salt for password hashing applications. It MUST have a length not greater than 2^(32)-1 bytes. 16 bytes is RECOMMENDED for password hashing. The salt SHOULD be unique for each password." https://www.rfc-editor.org/rfc/rfc9106.html#section-3.1-2.2

"Select the salt length. A length of 128 bits is sufficient for all applications but can be reduced to 64 bits in the case of space constraints." https://www.rfc-editor.org/rfc/rfc9106.html#section-4-6.7

Also this contradicts at least 128 bits and at most 256 bits from:

I don't think that does contradict anything because it's referring to the salt, which I've recommended 128 or 256 bits for before. SHOULD is equivalent to RECOMMENDED, and SHOULD NOT is equivalent to NOT RECOMMENDED, so they're not absolutes again.

Yeah I read that wrong.

This will make it slower and it's already slow. I guess this will only make it 33% slower (3 vs 4 H/block). Assuming cached HMAC.

Ah this won't make it slower. Assuming the implementation caches the state after the key is added. Unlike https://github.com/advisories/GHSA-4c42-4rxm-x6qf, PBKDF2 DoS with long password. Besides doing twice the work for normal sized passwords (4iterations vs 2iterations+2).

The idea is to allow the use of HMAC because that's probably what NIST wants whilst allowing everyone else to use an ordinary hash function/non-approved algorithm instead (e.g., BLAKE2). It's not ideal though because of things like keyed BLAKE2/BLAKE3 vs prefix MAC for BLAKE2/BLAKE3, which may cause confusion/interoperability issues.

The plan is to do key = PRF(emptyKey, password || salt || password.Length || salt.Length) and then use that key as the key everywhere else, except when computing the pseudorandom bytes since that should be password-independent. That can use an empty key again. This isn't ideal with prefix MAC but is like HKDF-Extract with HMAC.

This would allow easily supporting a pepper because that initial emptyKey could be the pepper, and it would be restricted to something like 512 bits to avoid HMAC key hashing, which was a HMAC design flaw.

Prefix MAC for BLAKE2/BLAKE3 is bad because it doesn't do the block operation until there's more than a block of data. Since the last block can be full. I feel like I'm saying this weirdly. The last block has a flag that states it's the last block and it can be full and no data is added for padding. Thus giving it a full block after initialization it will wait for finish to be called or more data before it mixes the block. So if it was PRF(key, BE64(counter++) || previous || buffer[m] || buffer[other1] || buffer[other2] || buffer[other3]) then one could blake2b_update(ctx, ZeroPad(key, 128)), blake2b_update(ctx, "\0"), make a copy of ctx, blake2b_update(ctx, BE56(counter++))... Or if it was PRF(key, "1" || LE64(counter1++) || previous || buffer[m] || buffer[other1] || buffer[other2] || buffer[other3]) (using two counters like above).

It would be best to state that Balloon with prefix MAC is invalid for BLAKE2 and BLAKE3. And that the only valid method is using the built in MAC for BLAKE2 and BLAKE3.

Sorry for my slow replies, got a lot going on at the moment.

PBKDF2 doesn't do this. Which means Balloon doesn't need to do this. And it's better if it doesn't.

Yeah, I'll try to remove this at the weekend.

That is a bad idea. PBKDF2's footgun exists because they wanted to do the least amount of changes/effort to make PBKDF output variable length.

Thanks for the examples. I was aware of 1Password having a problem.

However, it turns out that I've misunderstood what Henry meant. His proposal is actually to grab more bytes/blocks from the final mixed buffer, meaning you don't have to do any extra hashing (no KDF and no rerunning the algorithm).

Of course, the obvious limitation is that the output length is now limited by the space cost. To get around that, you end up rerunning the Mix phase, which introduces this PBKDF2 footgun again and complicates the code. Another issue might be with collision resistance when parallelism is used.

I understand the appeal, but I think using a KDF is cleaner. I'm interested to know your thoughts.

Oh the counter is the same... OK counter = Ceiling(3 spaceCostLog2 2^spaceCostLog2 * timeCost / HASH_BIT_LEN) instead of 0. Or have different counters maybe H("0" || LE64(counter0++) || ...) and H("1" || LE64(counter1++) || ...). And give a description or an example like pseudorandomStream={0x21, 0x43, 0x65, 0x87, ...} and pseudorandomStream.popBitsLE(12) returns 0x321 and the next call returns 0x654.

I'm a bit confused by your last sentence, and is the purpose of using separate counters to allow easier precomputation?

I remember there being a conversation on the PHC mailing list about whether salt should be included for secret independent reads. I remember at least one submission that did this. I'm pretty sure it was determined to be better not to include salt. Also Argon2i doesn't include the salt.

I'll try to find that.

WhatsApp, and 1Password use PAKEs. I think also Apple and Signal. The RFC recommends 128 bit salts, but they are very lax on requirements 8 to 2^(32)-1 bytes:

Yeah, I can imagine some of the bigger companies using them/them being used more in the future.

And yes, I could add a comment about smaller salts being acceptable, but the wording doesn't prohibit any size of salt from being used currently. It can even be empty because that's what the scrypt and Argon2 RFCs seem to allow.

It would be best to state that Balloon with prefix MAC is invalid for BLAKE2 and BLAKE3. And that the only valid method is using the built in MAC for BLAKE2 and BLAKE3.

I've been planning to make it a bit clearer. I'll say something like If the hash function supports a key parameter (e.g. BLAKE2 [RFC]), it MUST be used. Otherwise, if the hash function does not have a key parameter, prefix MAC MUST be used....

Sorry for my slow replies, got a lot going on at the moment.

Don't worry about.

That is a bad idea. PBKDF2's footgun exists because they wanted to do the least amount of changes/effort to make PBKDF output variable length.

Thanks for the examples. I was aware of 1Password having a problem.

However, it turns out that I've misunderstood what Henry meant. His proposal is actually to grab more bytes/blocks from the final mixed buffer, meaning you don't have to do any extra hashing (no KDF and no rerunning the algorithm).

Of course, the obvious limitation is that the output length is now limited by the space cost. To get around that, you end up rerunning the Mix phase, which introduces this PBKDF2 footgun again and complicates the code. Another issue might be with collision resistance when parallelism is used.

I understand the appeal, but I think using a KDF is cleaner. I'm interested to know your thoughts.

Definitely KDF. If you output the first hash from the memory an attacker can skip the last iteration. Also currently it's 4*t*m*p hash blocks to do the main work and HKDF is 2*ceiling(outputLength/hashLength)+6 hash blocks. Getting 4*t*m*p down to ~3.2*t*m*p will benefit much more. Using a pseudorandom stream will remove about 0.8*t*m*p hash blocks.

Oh the counter is the same... OK counter = Ceiling(3 spaceCostLog2 2^spaceCostLog2 * timeCost / HASH_BIT_LEN) instead of 0. Or have different counters maybe H("0" || LE64(counter0++) || ...) and H("1" || LE64(counter1++) || ...). And give a description or an example like pseudorandomStream={0x21, 0x43, 0x65, 0x87, ...} and pseudorandomStream.popBitsLE(12) returns 0x321 and the next call returns 0x654.

I'm a bit confused by your last sentence

pseudorandomStream={0x21, 0x43, 0x65, 0x87, ...} is defining the example byte data in pseudorandomStream and pseudorandomStream.popBitsLE(12) gets 12 bits from pseudorandomStream which is 0x321 == (0x21 | (0x43 << 8)) & ((1 << 12) - 1). Now pseudorandomStream={0x4 (4 bits), 0x65, 0x87, ...} and the next pseudorandomStream.popBitsLE(12) call returns 0x654 == (0x4 | (0x65 << 4)) & ((1 << 12) - 1).

and is the purpose of using separate counters to allow easier precomputation?

No, I was trying to keep the counters from overlapping or having explicit domain separation with "0" and "1".

X = Ceiling(3 * spaceCostLog2 * 2^spaceCostLog2 * timeCost / HASH_BIT_LEN)

If you start the pseudorandom stream counter at 0 it will end at X-1. Starting the other counter at X will make sure they don't overlap. You could just start both at zero because you're hashing two different fixed length strings (1, 64 bit int and 4, 32 bit ints vs 1, 64 bit int and 5 hashes).

samuel-lucas6 / draft-lucas-balloon-hashing

Fixes and suggestions #14