samuel-lucas6 / draft-lucas-bkdf

An Internet-Draft for the Balloon Key Derivation Function (BKDF), a memory-hard password hashing and password-based key derivation function.
Other
4 stars 1 forks source link

Provide generic parameters/safe minimums #2

Closed samuel-lucas6 closed 2 months ago

samuel-lucas6 commented 8 months ago

OWASP, TobTu, the Argon2 RFC, etc provide recommended parameters. It would be good to do the same assuming they are realistic, but there's no guidance on parameters from the authors of Balloon. Furthermore, Balloon hasn't been benchmarked properly with different hash functions and parameters to determine suitable recommendations.

Sc00bz commented 2 months ago

The best way to do this is count the amount of hash block calculations and compare to PBKDF2 because a GPU attacker will be limited by compute instead of bandwidth. Balloon needs to use >1 MiB to start slowing down GPUs. >3 MiB will have a near linear slowdown.

I'm going to use "H/block" to mean "hash block calculations per block of memory". This is how many hash blocks are calculated to output a hash. Depending on implementation and settings (I'll use "delta" or "neighbors" = 3). This is usually 3-17 H/block. Some bad implementations depend on salt length and worse depend on secret data and its length. Defender costs are paper (14 H/block), academic code (3 H/block + 0.5 AES/block), Rust's (14-17+ H/block (depends on salt and secret data lengths)), Python (14-17+ H/block (depends on salt length)). Attacker costs are paper (8 H/block), academic code (3 H/block), Rust's (8 H/block), Python (8 H/block).

Rough settings for <10 KH/s/GPU (GPU being an RTX 4080 Super or "2/3 of an RTX 4090" which should be within a few percent of each other): Balloon-SHA-256: Academic code: m=256 KiB, t=48; m=512 KiB, t=24; m=1 MiB, t=12; m=2 MiB, t=3-6? Paper/Rust's/Python: m=256 KiB, t=18; m=512 KiB, t=9; m=1 MiB, t=5; m=2 MiB, t=2-3?

Balloon-SHA-512: Academic code: m=256 KiB, t=34; m=512 KiB, t=17; m=1 MiB, t=9; m=2 MiB, t=2-4? Paper/Rust's/Python: m=256 KiB, t=13; m=512 KiB, t=7; m=1 MiB, t=4; m=2 MiB, t=1-2?

I looked into what the best output size of SHAKE128 for max speed: 248, 520, 792, 1056, and 1328 are 1, 2, 3, 4, and 5 H/block. Best is SHAKE128 with 1328 bits of output but it's not a multiple of a cache line so 512 (2 H/block) or 1024 (4 H/block) would be better. But this puts it at about even with SHA-512 for the attacker. Now the only question is which is faster for the defender. SHAKE128-1024 might be faster even if it's slower than SHA-512 because of the larger block size. Also if the same hash function is used for the deterministic random lookups then SHAKE128-1024 will have an advantage when grabbing 64 bit at a time vs big int math. Note the "academic code" is SHA-256 for the hash and AES-128 for the deterministic random lookups. I just remembered that SHAKE128 has an internal parallelism of 5. Which means the slowdown for an attacker doesn't start until >5 MiB but to utilize that there's a performance hit. Yeah... I'll need to implement it to see where it lands. Not that I'm going to do that any time soon.

samuel-lucas6 commented 2 months ago

After #16, are the academic code recommendations still ok? I've gone for the higher t= where you've put a question mark.

Sc00bz commented 2 months ago

With prefix MAC, there's no change. With HMAC, attacker costs will go from 3 H/block to 4 H/block. These formulas and settings are just until a public optimized GPU cracker is released and benchmarked on an RTX 4080 Super and/or an RTX 4090. Also this assumes that the GPU bottleneck is computational vs bandwidth. Which is true for 3 H/block, SHA2, and current gen GPUs.

TL;DR (oh and the chart at the end):

timeCost > (rawHashSpeed / 10000 / parallelism - spaceCost) / (attackerCost * spaceCost)

SHA-256: timeCost > (1183275.4267 / parallelism - spaceCost) / (attackerCost * spaceCost)
SHA-512: timeCost > (416536.12 / parallelism - spaceCost) / (attackerCost * spaceCost)

Hashes calculated given settings (you can assume attackerCostFill is 1. This only fails with weird hashes like SHAKE128 with output of more than 1272 bits):

attackerCostFill = 1 H/block

cost = (attackerCost * spaceCost * timeCost + attackerCostFill * spaceCost) * parallelism

GPU cost target for minimum settings "<10 kH/s/GPU" (note this "H" is password hashes vs raw hashes). Speeds are "2/3 of an RTX 4090" from PBKDF2-HMAC-SHA256 and PBKDF2-HMAC-SHA512 with 1000 iterations. Multiply by 2002 to get rawHashSpeed. Note using the raw hash benchmarks will give higher speeds because it can skip the first 1-ish and last 3-ish rounds. Also some of the message schedule calculations can be saved. Some of this is done for PBKDF2. It's just that it can only cheat on 3 out of the 2002 hashes. Note "2/3 of an RTX 4090" should be within a few percent of an RTX 4080 Super.

rawHashSpeed = 11,832,754,267 H/s (SHA-256*)
rawHashSpeed = 4,165,361,200 H/s (SHA-512*)

cost > rawHashSpeed / 10000

Together and solve for timeCost:

(attackerCost * spaceCost * timeCost + spaceCost) * parallelism > rawHashSpeed / 10000

timeCost > (rawHashSpeed / 10000 / parallelism - spaceCost) / (attackerCost * spaceCost)

SHA-256: timeCost > (1183275.4267 / parallelism - spaceCost) / (attackerCost * spaceCost)
SHA-512: timeCost > (416536.12 / parallelism - spaceCost) / (attackerCost * spaceCost)

timeCost given attackerCost, memory size, and parallelism=1:

SHA-256
   | 256 KiB | 512 KiB | 1 MiB | 2 MiB
---+---------+---------+-------+-------
 3 |    48   |    24   |   12  |    6
 4 |    36   |    18   |    9  |    5
 8 |    18   |     9   |    5  |    3

SHA-512
   | 256 KiB | 512 KiB | 1 MiB | 2 MiB
---+---------+---------+-------+-------
 3 |    34   |    17   |    9  |    4
 4 |    26   |    13   |    7  |    3
 8 |    13   |     7   |    4  |    2
samuel-lucas6 commented 2 months ago

Thanks for those formulas and figures.

I'm going to use "H/block" to mean "hash block calculations per block of memory". This is how many hash blocks are calculated to output a hash.

I'm getting confused by this definition. For the BalloonCore function, there's 1 hash function call for the Expand phase and 2 for the Mix phase per Balloon block of memory.

With prefix MAC, the key is getting padded to the block size, meaning each hash function call is actually 2 internal blocks. With HMAC, there are 2 calls to the hash function, each doing 2 internal blocks because the key is again padded to the block size.

So, where is the 3 vs 4 H/block distinction coming from? Is it to do with caching the key? But then prefix MAC should also be 4 because the key takes up its own block to induce a random IV.

Sc00bz commented 2 months ago

Is it to do with caching the key?

Yes. These numbers are what an attacker needs to do. A defender can do the same but it's not necessary. Although this will cause the defender to waste power and time. Or worse like with CVE-2013-1443. It was a PBKDF2 implementation that didn't use cached HMAC and needed to do 4+ blocks every iteration instead of 2+ blocks to add the key then 2 blocks every iteration (the CVE was DoS with long passwords because it needed to hash a long password every iteration).


current = H(pad_block(key) || previous || current || rand1 || rand2 || rand3 || LE64(counter++))

pad_block(key) is one block but can be cached thus counts as 0.

previous || current, rand1 || rand2, and rand3 || LE64(counter++) are one block each (SHA-256, SHA-512 or others but not all hashes work this way). Thus 3 blocks.


current = HMAC(key, || previous || current || rand1 || rand2 || rand3 || LE64(counter++))

Which is:

if (len(key) > block_size) key = H(key)
inner   = H((pad_block(key) ^ inner_pad) || previous || current || rand1 || rand2 || rand3 || LE64(counter++))
current = H((pad_block(key) ^ outer_pad) || inner)

(pad_block(key) ^ inner_pad) and (pad_block(key) ^ outer_pad) are one block each but can be cached thus counts as 0.

previous || current, rand1 || rand2, rand3 || LE64(counter++), and inner are one block each. Thus 4 blocks.


Oh right, an attacker doesn't need to calculate the random offsets because 6-12 bytes of bandwidth is much cheaper than a hash calculation. So the attacker caches these too.

samuel-lucas6 commented 2 months ago

Appreciate it. That makes sense. Not sure how I didn't see that; think I was distracted by the non-mixing parts. I'll add the 4 H/block numbers to the document for HMAC-SHA256/HMAC-SHA512.