Open solardiz opened 2 years ago
There's also a config option Global_MaxDuration
which will override format's default. Setting that (to eg. 400) would not work very well for formats with a single slow kernel though (they typically have a figure in the thousands as input to autotune_run()
).
There's also a config option
Global_MaxDuration
which will override format's default
Oh, and there are format-specific versions of the above, using format name. This happens automagically for any format that use our shared OpenCL stuff. So for bitcoin-opencl it's Bitcoin_MaxDuration
.
I just verified that it still works and it does - except for the first autotune step, where the duration is usually halved: If I set `Bitcoin_MaxDuration to 50, it'll be 50 in both the first and last autotune step. Can't see why but I'll open an issue for it.
Hashcat's long option for -w
is --workload-profile
, perhaps we should go with that? I don't care which.
Also,
I just verified that it still works and it does - except for the first autotune step
There's more to it, see #4916.
Turns out the issue with bitcoin-opencl
on Vega 64 is mostly different. Observe:
gws: 8192 2574 c/s 515984040 rounds/s 3.181 s per crypt_all()+
xfer: 165.038 us, init: 138.520 us, loop: 100x34.538 ms, final: 128.148 us, xfer: 7.556 us
gws: 16384 4732 c/s 948576720 rounds/s 3.462 s per crypt_all()+
xfer: 326.962 us, init: 247.849 us, loop: 100x69.943 ms, final: 223.852 us, xfer: 13.926 us
gws: 32768 4673 c/s 936749580 rounds/s 7.011 s per crypt_all()
xfer: 650.074 us, init: 475.408 us, loop: 100x137.542 ms, final: 479.406 us, xfer: 84.704 us
gws: 65536 4753 c/s 952786380 rounds/s 13.787 s per crypt_all()
xfer: 1.359 ms, init: 923.408 us, loop: 100x273.538 ms (exceeds 200 ms)
xfer: 84.446 us, init: 147.702 us, loop: 100x31.931 ms, final: 83.408 us, xfer: 4.148 us
gws: 8192 2559 c/s 512977140 rounds/s 3.200 s per crypt_all()-
LWS=256 GWS=16384 (64 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw: 4412 c/s real, 446836 c/s virtual
Notice how during auto-tuning it had much better speed for GWS=16384 than it ended up having in the end. My guess is the GPU's clock rate dropped (to maintain TDP) moments after it managed to run at almost full performance at GWS=16384. Then higher GWS shows almost the same performance (sometimes reaching our required 1%+ improvement, sometimes not), although at the same clock rate higher GWS would probably show significant improvement. This significant difference is seen when manually forcing higher GWS or sometimes (still not reliably) when I allow 400 ms (then there's sometimes 1%+ improvement seen during auto-tuning at GWS=131072, which then translates into a 7% improvement vs. a separate GWS=16384 benchmark). I don't see an easy fix for this. We do already try one step back from seemingly best GWS (it's the final test of GWS=8192 above). Should we also do a similar forward re-test if the performance at the previously seen seemingly best GWS reduces substantially compared to when it was seen as best?
Hashcat's long option for
-w
is--workload-profile
, perhaps we should go with that? I don't care which.
I vaguely recall this being called "intensity" in some popular program years back, maybe in older hashcat? To me, the word "intensity" makes more sense. Also, if we introduce any option starting with w
, we wouldn't be able to specify the shortcut -w
for wordlist mode. And we've already lost -i
for incremental mode, so nothing more to lose there.
I think increasing the intensity should not only increase the maximum kernel duration, but also lower the speedup threshold for accepting a GWS increase (the 1%, etc. speedup that we currently require).
That sounds sensible to me too
Our formats typically use 200 milliseconds as the maximum OpenCL kernel duration for auto-tuning, passing this number into
autotune_run
. Perhaps we can haveautotune_run
adjust this value based on a command-line option, e.g.--intensity=2
would double it, making most kernels tune to up to 400 ms. We can also accept e.g.--intensity=0
or--intensity=0.5
for halving the maximum kernel duration.This is probably similar to what hashcat does with its
-w
option.Right now, we can adjust this with
LWS
/GWS
env vars or--lws
/--gws
command-line options, but these are trickier to use and their optimal values differ per-device and per-format. I think we need to provide an easier to use option.For a specific example, (with the changes I'm about to commit for the shared SHA-2 code)
bitcoin-opencl
achieves this on Vega 64 by default:but with the limit increased to 400 ms it's this:
I think it'd be intuitive for a user to request
--intensity=2
(or higher) for all (maybe different) GPUs on a headless system. It's less intuitive to guess there's improvement by manually doublingGWS
a few times, and the resulting values would need to be managed per-GPU.In the specific example above, I was wondering whether I should lower
HASH_LOOPS
, but somehow lowering it from 2000 to 500 didn't result in a similar increase inGWS
, but instead resulted in a decrease inLWS
and in lower speed (with the limit at 200 ms):So it does appear that we need to allow more than 200 ms sometimes.