For a few salted slow hashes, distribute salts across work-items

Problem: when cracking some of the slowest hashes on GPU and there are a lot of different salts, it takes ages for the cracking to advance to the next batch of candidate passwords, or alternatively the efficiency is low if the number of candidate passwords is too small.

Solution: if we simultaneously (in one OpenCL kernel invocation) compute hashes for multiple salts, then we can fully utilize the GPU with fewer candidates at a time. That would improve interactive usability (such as when experimenting with different brief attacks) and efficiency.

In August, I had this brief e-mail exchange with @alainesp:

On Wed, Aug 5, 2020 at 4:05 PM Solar Designer <solar@openwall.com> wrote:

> An ideal password cracker would also support distribution of salts
> across hardware threads, not only of candidate passwords.

Hash Suite supports this for md5crypt and ssha on the GPU.

> But supporting both kinds would complicate its design a lot...

As I recall the design was only slightly more complicated, not a lot. The
code is around this line:

https://github.com/alainesp/HashSuiteDroid/blob/b7e07ac9a4a79c05ed3b167b3f28c261416a964e/Hash_Suite/format_MD5CRYPT.c#L896

Regards,
Alain

This got me thinking: if Hash Suite does this per-format, why can't we do the same without a redesign? I think we actually can. Our crypt_all now receives a pointer to the current salt structure in the database, which has a pointer to the next salt in it, and so on. This lets crypt_all precompute hashes for further salts, and then further calls to crypt_all for those salts can return almost instantly (they merely need to shift the logical view for cmp_* and get_hash* to point to the right salt's set of computed hashes). For this to make sense, max_keys_per_crypt needs to be reduced accordingly to salt count. That might be trickier as the salt count may be reducing as some salts get fully cracked.

Perhaps there are not too many formats for which we'd want to implement this. I think it's primarily sha512crypt-opencl, then maybe also sha256crypt-opencl, phpass-opencl, md5crypt-opencl. Also bcrypt-opencl, but I think that one needs to be reworked first, and it's not as important since it provides less of an advantage over CPUs (often none at all, depending on which CPU vs. which GPU). Once we have (ye)scrypt and Argon2 on GPU, then those too. And that's about it?

We could also want this on CPU+OpenMP with high numbers of threads, but this issue isn't as pressing yet.

Another potential workaround (already recorded on some issue in here? I don't recall) is to optionally distribute groups of salts across forked processes, which would speedup processing of the batches of candidate passwords (which will stay as large as before) due to having fewer salts per process. This isn't as effective as the solution proposed above, but it isn't per-format.

openwall / john

For a few salted slow hashes, distribute salts across work-items #4345