bcrypt-opencl is much slower than should be on NVIDIA Maxwell & Pascal

solardiz commented 7 years ago

Per my testing, our bcrypt-opencl is roughly same speed as hashcat's on Tahiti (which was the primary target when Sayantan worked on our bcrypt-opencl), but is much slower on NVIDIA Maxwell (and per others' testing also on Pascal). These newer NVIDIA GPUs matter for bcrypt-opencl especially because they're the ones where bcrypt-opencl's speeds would finally be reasonable and competitive vs. CPUs.

Titan X Maxwell JtR:

Device 6: GeForce GTX TITAN X
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw:    4735 c/s real, 4708 c/s virtual

Titan X Maxwell hashcat:

* Device #7: GeForce GTX TITAN X, 3071/12287 MB allocatable, 24MCU
* Device #8: GeForce GTX TITAN, skipped.

Hashtype: bcrypt $2*$, Blowfish (Unix)

Speed.Dev.#7.....:     6679 H/s (56.21ms)

(On a related note, hashcat's device numbering starts with 1 and JtR's with 0 - maybe this is something to adjust as well.)

Others report even bigger differences in favor of hashcat on Pascal cards, but that may be because our MULTIPLIER in opencl_bf_std.h isn't large enough for those bigger GPUs.

I tried tweaking DEFAULT_LWS and MULTIPLIER, but could not achieve much of an improvement on our Titan X Maxwell card in this way alone. Per a quick glance at hashcat, it appears to use LWS=8 (I tried, this made little difference), local memory (not private, and we also do the same on these GPUs), different layout of the S-boxes in local memory than ours (would take some effort to try, but perhaps we should), and slightly different code (ditto). Besides trying things, we could also review the ptx assembly.

Also, hashcat appears to compute the hashes on GPU only (first 128 bits only, that is doing 2 out of 3 blocks 64-bit for the final 64 Blowfish encryptions?), whereas we compute the first 64 bits on GPU and transfer the resulting S-boxes to host in order to allow for computing of the rest from that point on in cmp_exact(). I don't know if that transfer to host is causing us much speed loss - probably it is not, but it's worth trying to comment it out and remove the check in cmp_exact() to see how much faster it could go with the 64-bit check only (which should be sufficient in practice).

Tahiti JtR:

Device 2: Tahiti [AMD Radeon HD 7900 Series]
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw:    4484 c/s real, 614400 c/s virtual

Tahiti hashcat:

Hashtype: bcrypt $2*$, Blowfish (Unix)

Speed.Dev.#3.....:     4315 H/s (57.65ms)

While at it, we should add FMT_TRUNC to opencl_bf_fmt_plug.c.

magnumripper commented 7 years ago

we compute the first 64 bits on GPU and transfer the resulting S-boxes to host in order to allow for computing of the rest from that point on in cmp_exact(). I don't know if that transfer to host is causing us much speed loss - probably it is not, but it's worth trying to comment it out and remove the check in cmp_exact() to see how much faster it could go with the 64-bit check only (which should be sufficient in practice).

I don't expect it to matter, but that transfer should obviously only happen within cmp_exact() [caveat: multiple calls to cmp_exact() shouldn't trigger multiple transfers of same data] - and then we can still perform the checks.

solardiz commented 7 years ago

transfer should obviously only happen within cmp_exact()

Makes sense. The kernel will still have to copy the data from local to global memory every time just in case, though.

Yet another thought: while for bcrypt hashes it would normally be sufficient to compute and check the first 64 bits only, there's now also bcrypt_pbkdf, which is used in openbsdsoftraid_fmt_plug.c and ssh_ng_fmt_plug.c. We don't currently have OpenCL equivalents of these formats, but perhaps we should add them, and they should mostly or fully reuse the same bcrypt kernel as we use for the hashes. bcrypt_pbkdf encrypts a different constant, and it's 256-bit (rather than 192-bit), and the output bytes are shuffled such that probably all four of the 64-bit Blowfish blocks need to be computed each time (we should double-check this, though). So the kernel should also support using bcrypt_pbkdf's full 256-bit constant and in that mode it should not need to ever export the S-boxes to host, or alternatively it should support skipping even the first 64-bit computation (so that all four blocks are computed on host), but that's worse (less scalable and probably slower with our current synchronous crypt_all() API). Similar changes would be required to reuse bcrypt-ztex's FPGA implementation.

solardiz commented 7 years ago

there's now also bcrypt_pbkdf, which is used in openbsdsoftraid_fmt_plug.c and ssh_ng_fmt_plug.c.

BTW, even on CPU this bcrypt_pbkdf uses its own Blowfish code taken from OpenBSD instead of our interleaved 2x or 3x bcrypt code. So there's room for unification and speedup on CPU as well.

openwall / john

bcrypt-opencl is much slower than should be on NVIDIA Maxwell & Pascal #2620