openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
https://www.openwall.com/john/
Other
10.37k stars 2.11k forks source link

Option to specify GPU usage intensity #4910

Open solardiz opened 2 years ago

solardiz commented 2 years ago

Our formats typically use 200 milliseconds as the maximum OpenCL kernel duration for auto-tuning, passing this number into autotune_run. Perhaps we can have autotune_run adjust this value based on a command-line option, e.g. --intensity=2 would double it, making most kernels tune to up to 400 ms. We can also accept e.g. --intensity=0 or --intensity=0.5 for halving the maximum kernel duration.

This is probably similar to what hashcat does with its -w option.

Right now, we can adjust this with LWS / GWS env vars or --lws / --gws command-line options, but these are trickier to use and their optimal values differ per-device and per-format. I think we need to provide an easier to use option.

For a specific example, (with the changes I'm about to commit for the shared SHA-2 code) bitcoin-opencl achieves this on Vega 64 by default:

LWS=256 GWS=16384 (64 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    4369 c/s real, 327680 c/s virtual

but with the limit increased to 400 ms it's this:

LWS=256 GWS=131072 (512 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    4691 c/s real, 3276K c/s virtual

I think it'd be intuitive for a user to request --intensity=2 (or higher) for all (maybe different) GPUs on a headless system. It's less intuitive to guess there's improvement by manually doubling GWS a few times, and the resulting values would need to be managed per-GPU.

In the specific example above, I was wondering whether I should lower HASH_LOOPS, but somehow lowering it from 2000 to 500 didn't result in a similar increase in GWS, but instead resulted in a decrease in LWS and in lower speed (with the limit at 200 ms):

LWS=64 GWS=16384 (256 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    4300 c/s real, 96376 c/s virtual

So it does appear that we need to allow more than 200 ms sometimes.

magnumripper commented 2 years ago

There's also a config option Global_MaxDuration which will override format's default. Setting that (to eg. 400) would not work very well for formats with a single slow kernel though (they typically have a figure in the thousands as input to autotune_run()).

magnumripper commented 2 years ago

There's also a config option Global_MaxDuration which will override format's default

Oh, and there are format-specific versions of the above, using format name. This happens automagically for any format that use our shared OpenCL stuff. So for bitcoin-opencl it's Bitcoin_MaxDuration.

I just verified that it still works and it does - except for the first autotune step, where the duration is usually halved: If I set `Bitcoin_MaxDuration to 50, it'll be 50 in both the first and last autotune step. Can't see why but I'll open an issue for it.

magnumripper commented 2 years ago

Hashcat's long option for -w is --workload-profile, perhaps we should go with that? I don't care which.

Also,

I just verified that it still works and it does - except for the first autotune step

There's more to it, see #4916.

solardiz commented 2 years ago

Turns out the issue with bitcoin-opencl on Vega 64 is mostly different. Observe:

gws:      8192     2574 c/s   515984040 rounds/s     3.181 s per crypt_all()+
xfer: 165.038 us, init: 138.520 us, loop: 100x34.538 ms, final: 128.148 us, xfer: 7.556 us
gws:     16384     4732 c/s   948576720 rounds/s     3.462 s per crypt_all()+
xfer: 326.962 us, init: 247.849 us, loop: 100x69.943 ms, final: 223.852 us, xfer: 13.926 us
gws:     32768     4673 c/s   936749580 rounds/s     7.011 s per crypt_all()
xfer: 650.074 us, init: 475.408 us, loop: 100x137.542 ms, final: 479.406 us, xfer: 84.704 us
gws:     65536     4753 c/s   952786380 rounds/s    13.787 s per crypt_all()
xfer: 1.359 ms, init: 923.408 us, loop: 100x273.538 ms (exceeds 200 ms)
xfer: 84.446 us, init: 147.702 us, loop: 100x31.931 ms, final: 83.408 us, xfer: 4.148 us
gws:      8192     2559 c/s   512977140 rounds/s     3.200 s per crypt_all()-
LWS=256 GWS=16384 (64 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    4412 c/s real, 446836 c/s virtual

Notice how during auto-tuning it had much better speed for GWS=16384 than it ended up having in the end. My guess is the GPU's clock rate dropped (to maintain TDP) moments after it managed to run at almost full performance at GWS=16384. Then higher GWS shows almost the same performance (sometimes reaching our required 1%+ improvement, sometimes not), although at the same clock rate higher GWS would probably show significant improvement. This significant difference is seen when manually forcing higher GWS or sometimes (still not reliably) when I allow 400 ms (then there's sometimes 1%+ improvement seen during auto-tuning at GWS=131072, which then translates into a 7% improvement vs. a separate GWS=16384 benchmark). I don't see an easy fix for this. We do already try one step back from seemingly best GWS (it's the final test of GWS=8192 above). Should we also do a similar forward re-test if the performance at the previously seen seemingly best GWS reduces substantially compared to when it was seen as best?

Hashcat's long option for -w is --workload-profile, perhaps we should go with that? I don't care which.

I vaguely recall this being called "intensity" in some popular program years back, maybe in older hashcat? To me, the word "intensity" makes more sense. Also, if we introduce any option starting with w, we wouldn't be able to specify the shortcut -w for wordlist mode. And we've already lost -i for incremental mode, so nothing more to lose there.

solardiz commented 2 years ago

I think increasing the intensity should not only increase the maximum kernel duration, but also lower the speedup threshold for accepting a GWS increase (the 1%, etc. speedup that we currently require).

magnumripper commented 2 years ago

That sounds sensible to me too