Provide generic parameters/safe minimums

The best way to do this is count the amount of hash block calculations and compare to PBKDF2 because a GPU attacker will be limited by compute instead of bandwidth. Balloon needs to use >1 MiB to start slowing down GPUs. >3 MiB will have a near linear slowdown.

I'm going to use "H/block" to mean "hash block calculations per block of memory". This is how many hash blocks are calculated to output a hash. Depending on implementation and settings (I'll use "delta" or "neighbors" = 3). This is usually 3-17 H/block. Some bad implementations depend on salt length and worse depend on secret data and its length. Defender costs are paper (14 H/block), academic code (3 H/block + 0.5 AES/block), Rust's (14-17+ H/block (depends on salt and secret data lengths)), Python (14-17+ H/block (depends on salt length)). Attacker costs are paper (8 H/block), academic code (3 H/block), Rust's (8 H/block), Python (8 H/block).

Rough settings for <10 KH/s/GPU (GPU being an RTX 4080 Super or "2/3 of an RTX 4090" which should be within a few percent of each other): Balloon-SHA-256: Academic code: m=256 KiB, t=48; m=512 KiB, t=24; m=1 MiB, t=12; m=2 MiB, t=3-6? Paper/Rust's/Python: m=256 KiB, t=18; m=512 KiB, t=9; m=1 MiB, t=5; m=2 MiB, t=2-3?

Balloon-SHA-512: Academic code: m=256 KiB, t=34; m=512 KiB, t=17; m=1 MiB, t=9; m=2 MiB, t=2-4? Paper/Rust's/Python: m=256 KiB, t=13; m=512 KiB, t=7; m=1 MiB, t=4; m=2 MiB, t=1-2?

I looked into what the best output size of SHAKE128 for max speed: 248, 520, 792, 1056, and 1328 are 1, 2, 3, 4, and 5 H/block. Best is SHAKE128 with 1328 bits of output but it's not a multiple of a cache line so 512 (2 H/block) or 1024 (4 H/block) would be better. But this puts it at about even with SHA-512 for the attacker. Now the only question is which is faster for the defender. SHAKE128-1024 might be faster even if it's slower than SHA-512 because of the larger block size. Also if the same hash function is used for the deterministic random lookups then SHAKE128-1024 will have an advantage when grabbing 64 bit at a time vs big int math. Note the "academic code" is SHA-256 for the hash and AES-128 for the deterministic random lookups. I just remembered that SHAKE128 has an internal parallelism of 5. Which means the slowdown for an attacker doesn't start until >5 MiB but to utilize that there's a performance hit. Yeah... I'll need to implement it to see where it lands. Not that I'm going to do that any time soon.

samuel-lucas6 / draft-lucas-balloon-hashing

Provide generic parameters/safe minimums #2