Closed kholia closed 2 years ago
I had a look, it's just as optimized as the other ones. However, SHA-512 is 64-bit while all current GPUs are 32-bit. So the compiler has to make 32-bit code for all 64-bit ops, making it half as fast - best case. I recall Atom mentioned he found a nice trick to make SHA-512 significantly faster but that was before they went open source so I have no idea what it was. Perhaps we should study their kernels!
In pure dictionary mode, hashcat's speeds is around 97,000 c/s.
As I recall, our sha512crypt-opencl is actually on par or even slightly faster than hashcat's equivalent as of the last time I tested. I think the optimizations to 64-bit rotate that I introduced a couple(?) of years ago are on par with whatever alternative hashcat had. We need to re-do this comparison at sha512crypt-opencl on both the Titan X Maxwell and @kholia's 1050 Ti, as well as possibly on other GPUs, and be also comparing speeds at PBKDF2-HMAC-SHA512-opencl at the same time. Also compare actual cracking runs with same-length mask mode for both tools (can choose length 7 try match hashcat's banchmark mode's test vector).
the optimizations to 64-bit rotate that I introduced a couple(?) of years ago
What optimizations was that? I only see this, specific for AMD (the one actually used now is slightly more complicated to work around some AMD bug when n
is a multiple of 8):
#define ror64(x, n) ((n) < 32 ? \
(amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | \
((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) \
: \
(amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n) - 32) | \
((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n) - 32) << 32)))
the one actually used now is slightly more complicated to work around some AMD bug when n is a multiple of 8
Apparently that workaround is only used in some formats. Since we don't seem to see any bug I'm going to remove it. There are rotates with multiples of 8 in SHA-512.
Hmm, yes - it was probably only this one for AMD. As I recall, we determined that on NVIDIA the funnel shift instructions are used for rotate() by the compiler, so no tricks in our code are needed. (We only had to use inline asm for them on NVIDIA in md5crypt, where the uses are trickier than for a rotate.)
As to removing the workaround for multiples of 8, I'd find john-dev postings where we discussed it first. It might still be relevant. I hear you regarding us not using it consistently anyway, though.
I'm pretty sure there were no john-dev postings, I recall I added it because Atom mentioned it somewhere (and was referring to ancient drivers IIRC). So I actually never saw the bug and since the other formats (eg. Claudio's) haven't had the problem I'm pretty sure we're OK without it.
There were john-dev postings, see this and the rest of that thread:
https://www.openwall.com/lists/john-dev/2015/10/19/2
The thread also mentions PBKDF2-HMAC-SHA512-opencl being unexpectedly slower than sha512crypt-opencl (perhaps in terms of SHA-512s per second).
Also see #1840.
Oh, right. Well, we'll see if we hit a problem somewhere - then perhaps we can re-apply the workaround selectively.
My Macbook's AMD is faster with JtR (not even using bitalign at all) than with hashcat:
Device 2: AMD Radeon Pro 560 Compute Engine
Benchmarking: PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 OpenCL]... DONE
Speed for cost 1 (iteration count) of 1000
Raw: 30624 c/s real, 3276K c/s virtual
* Device #3: AMD Radeon Pro 560 Compute Engine, 1024/4096 MB allocatable, 16MCU
Speed.#3.........: 25759 H/s (70.61ms) @ Accel:32 Loops:15 Thr:256 Vec:1
Super's Titan X (with banned-old driver so using --force
) is much faster with hashcat though:
Hashmode: 12100 - PBKDF2-HMAC-SHA512 (Iterations: 999)
Speed.#7.........: 302.9 kH/s (58.22ms) @ Accel:64 Loops:31 Thr:384 Vec:1
Device 6: GeForce GTX TITAN X
Benchmarking: PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 OpenCL]... DONE
Speed for cost 1 (iteration count) of 1000
Raw: 177124 c/s real, 175542 c/s virtual, GPU util: 99%
I can see they do a few things using 32-bit halves but nothing revolutionary or that I think should matter much (perhaps it does though, but that would be extremely poor compiler/optimizer performance by the runtime). The main functions, Maj, Ch and sigmas, are standard stuff. I have no idea why they are so much faster, it almost looks like we do a lot of redundant work somewhere, by mistake. But that should show up on my Macbook's AMD as well.
Latest hashcat & jumbo
hashcat (v5.1.0-797-g5a1d929) starting in benchmark mode...
* Device #1: gfx900, 4048/8176 MB allocatable, 64MCU
* Device #4: GeForce GTX 1080, 2029/8119 MB allocatable, 20MCU
Hashmode: 12100 - PBKDF2-HMAC-SHA512 (Iterations: 999)
Speed.#1.........: 352.0 kH/s (86.10ms) @ Accel:128 Loops:62 Thr:64 Vec:1
Speed.#4.........: 341.8 kH/s (52.37ms) @ Accel:256 Loops:124 Thr:32 Vec:1
Device 1: gfx900 [Radeon RX Vega]
Benchmarking: PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 OpenCL]... DONE
Speed for cost 1 (iteration count) of 1000
Raw: 324435 c/s real, 8192K c/s virtual
Device 4: GeForce GTX 1080
Benchmarking: PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 OpenCL]... DONE
Speed for cost 1 (iteration count) of 1000
Raw: 254015 c/s real, 254015 c/s virtual, GPU util: 100%
This still beats me
Looks like the register pressure in pbkdf2_sha512_loop
is too high on NVIDIA:
ptxas info : Compiling entry function 'pbkdf2_sha512_loop' for 'sm_61'
ptxas info : Function properties for pbkdf2_sha512_loop
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 162 registers, 336 bytes cmem[0], 8 bytes cmem[2]
Can we reasonably reduce it?
This didn't help (still 162 registers and roughly the same speed):
+++ b/run/opencl/pbkdf2_hmac_sha512_kernel.cl
@@ -123,20 +123,17 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
uint i, rounds = state[idx].rounds;
uint r = MIN(rounds, HASH_LOOPS);
ulong W[16];
- ulong ipad_state[8];
- ulong opad_state[8];
ulong tmp_out[8];
for (i = 0; i < 8; i++) {
W[i] = state[idx].W[i];
- ipad_state[i] = state[idx].ipad[i];
- opad_state[i] = state[idx].opad[i];
tmp_out[i] = state[idx].hash[i];
}
for (i = 0; i < r; i++) {
ulong A, B, C, D, E, F, G, H, t;
+#define ipad_state state[idx].ipad
A = ipad_state[0];
B = ipad_state[1];
C = ipad_state[2];
@@ -159,9 +156,11 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
W[5] = F + ipad_state[5];
W[6] = G + ipad_state[6];
W[7] = H + ipad_state[7];
+#undef ipad_state
W[8] = 0x8000000000000000UL;
W[15] = 0x600;
+#define opad_state state[idx].opad
A = opad_state[0];
B = opad_state[1];
C = opad_state[2];
@@ -181,6 +180,7 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
W[5] = F += opad_state[5];
W[6] = G += opad_state[6];
W[7] = H += opad_state[7];
+#undef opad_state
tmp_out[0] ^= A;
tmp_out[1] ^= B;
Also getting rid of tmp_out
reduces the number of registers from 162 to 158, almost no change in speed. (I guess we need at most 128.)
+++ b/run/opencl/pbkdf2_hmac_sha512_kernel.cl
@@ -123,20 +123,15 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
uint i, rounds = state[idx].rounds;
uint r = MIN(rounds, HASH_LOOPS);
ulong W[16];
- ulong ipad_state[8];
- ulong opad_state[8];
- ulong tmp_out[8];
for (i = 0; i < 8; i++) {
W[i] = state[idx].W[i];
- ipad_state[i] = state[idx].ipad[i];
- opad_state[i] = state[idx].opad[i];
- tmp_out[i] = state[idx].hash[i];
}
for (i = 0; i < r; i++) {
ulong A, B, C, D, E, F, G, H, t;
+#define ipad_state state[idx].ipad
A = ipad_state[0];
B = ipad_state[1];
C = ipad_state[2];
@@ -159,9 +154,11 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
W[5] = F + ipad_state[5];
W[6] = G + ipad_state[6];
W[7] = H + ipad_state[7];
+#undef ipad_state
W[8] = 0x8000000000000000UL;
W[15] = 0x600;
+#define opad_state state[idx].opad
A = opad_state[0];
B = opad_state[1];
C = opad_state[2];
@@ -181,7 +178,9 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
W[5] = F += opad_state[5];
W[6] = G += opad_state[6];
W[7] = H += opad_state[7];
+#undef opad_state
+#define tmp_out state[idx].hash
tmp_out[0] ^= A;
tmp_out[1] ^= B;
tmp_out[2] ^= C;
@@ -198,7 +197,6 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
if (rounds) { // there is still work to do
for (i = 0; i < 8; i++) {
- state[idx].hash[i] = tmp_out[i];
state[idx].W[i] = W[i];
}
}
@@ -206,6 +204,7 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
for (i = 0; i < 8; i++)
out[idx].hash[i] = tmp_out[i];
}
+#undef tmp_out
}
__kernel void pbkdf2_sha512_kernel(__global const pass_t *inbuffer,
I just confirmed we're still faster than hashcat. I really wanted to see the ptxas info for hashcat but I see no way to do so, short of hacking the code.
I just confirmed we're still faster than hashcat.
I assume you mean slower.
BTW, for when we fix this issue, tezos-opencl
uses a revised copy of the same code and would need to have any fixes ported over to it (or switched to using shared code). Maybe there are more copies for other formats, I didn't check.
Edit: I was wrong, tezos-opencl
only has separate uses of SHA-512 in its init (and soon final) kernel(s). It reuses the shared kernel for the main PBKDF2 loop, without code duplication.
I just confirmed we're still faster than hashcat.
I assume you mean slower.
No. The "still" was relating to this older post of yours:
As I recall, our sha512crypt-opencl is actually on par or even slightly faster than hashcat's equivalent as of the last time I tested.
Having tested a bit more, I'd say we're still on par - sometimes faster, sometimes slower. That's with a 2080ti.
A relevant difference between our code in opencl/opencl_sha2.h
and opencl/cryptsha512_kernel_GPU.cl
appears to be that the former expands W
for rounds 17 to 80 on the fly, whereas the latter writes into w
before each group of 16 rounds (like compact implementations of SHA-512 typically do). hashcat also uses the latter approach in its OpenCL/inc_hash_sha512.cl
. I guess our on the fly expansion results in excessive register pressure, where the compiler somehow fails to figure out that it can discard some past values already (even though I find this surprising as it shouldn't be that hard a task for the compiler here).
@magnumripper Would you like to try reworking the code accordingly soon or should I or someone else try?
That's interesting, I'll bite
I was wrong - missed a write into wi
inside the round macros in opencl/opencl_sha2.h
. So that "difference" doesn't actually exist. However, one that does is that we have the code manually fully unrolled in opencl/opencl_sha2.h
, whereas in opencl/cryptsha512_kernel_GPU.cl
and in hashcat's OpenCL/inc_hash_sha512.cl
it's loops that the compiler may or may not unroll.
I've just tried partially rolling the loop, and it provides (only) slight reduction in register allocation (162 to 153), but significant speedup. Here's GTX 1080 when it's cold (so max turbo):
[solar@super run]$ ./john -test -form=pbkdf2-hmac-sha512-opencl -dev=4
Device 4: GeForce GTX 1080
Benchmarking: PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 OpenCL]... LWS=128 GWS=81920 (640 blocks) DONE
Speed for cost 1 (iteration count) of 1000
Raw: 370678 c/s real, 367354 c/s virtual, Dev#4 util: 99%
A comment from 2019 above says this was 254k with our code (which I reproduced now) and 342k with hashcat. So I like 370k.
+++ b/run/opencl/opencl_sha2.h
@@ -627,6 +627,25 @@ __constant ulong K[] = {
ROUND512_B(D,E,F,G,H,A,B,C,K[29],W[13], W[11],z,z,W[6]) \
ROUND512_B(C,D,E,F,G,H,A,B,K[30],W[14], W[12],W[15],z,W[7]) \
ROUND512_B(B,C,D,E,F,G,H,A,K[31],W[15], W[13],W[0],W[15],W[8]) \
+ for (uint i = 32; i < 80; i += 16) { \
+ ROUND512_B(A,B,C,D,E,F,G,H,K[i],W[0], W[14],W[1],W[0],W[9]) \
+ ROUND512_B(H,A,B,C,D,E,F,G,K[i+1],W[1], W[15],W[2],W[1],W[10]) \
+ ROUND512_B(G,H,A,B,C,D,E,F,K[i+2],W[2], W[0],W[3],W[2],W[11]) \
+ ROUND512_B(F,G,H,A,B,C,D,E,K[i+3],W[3], W[1],W[4],W[3],W[12]) \
+ ROUND512_B(E,F,G,H,A,B,C,D,K[i+4],W[4], W[2],W[5],W[4],W[13]) \
+ ROUND512_B(D,E,F,G,H,A,B,C,K[i+5],W[5], W[3],W[6],W[5],W[14]) \
+ ROUND512_B(C,D,E,F,G,H,A,B,K[i+6],W[6], W[4],W[7],W[6],W[15]) \
+ ROUND512_B(B,C,D,E,F,G,H,A,K[i+7],W[7], W[5],W[8],W[7],W[0]) \
+ ROUND512_B(A,B,C,D,E,F,G,H,K[i+8],W[8], W[6],W[9],W[8],W[1]) \
+ ROUND512_B(H,A,B,C,D,E,F,G,K[i+9],W[9], W[7],W[10],W[9],W[2]) \
+ ROUND512_B(G,H,A,B,C,D,E,F,K[i+10],W[10], W[8],W[11],W[10],W[3]) \
+ ROUND512_B(F,G,H,A,B,C,D,E,K[i+11],W[11], W[9],W[12],W[11],W[4]) \
+ ROUND512_B(E,F,G,H,A,B,C,D,K[i+12],W[12], W[10],W[13],W[12],W[5]) \
+ ROUND512_B(D,E,F,G,H,A,B,C,K[i+13],W[13], W[11],W[14],W[13],W[6]) \
+ ROUND512_B(C,D,E,F,G,H,A,B,K[i+14],W[14], W[12],W[15],W[14],W[7]) \
+ ROUND512_B(B,C,D,E,F,G,H,A,K[i+15],W[15], W[13],W[0],W[15],W[8]) }
+
+#define disab \
ROUND512_B(A,B,C,D,E,F,G,H,K[32],W[0], W[14],W[1],W[0],W[9]) \
ROUND512_B(H,A,B,C,D,E,F,G,K[33],W[1], W[15],W[2],W[1],W[10]) \
ROUND512_B(G,H,A,B,C,D,E,F,K[34],W[2], W[0],W[3],W[2],W[11]) \
This leaves the first 32 rounds unrolled. The first 16 are genuinely different, the second 16 use some zeroes (can instead read zeroes from W
if made part of the rolled loop introduced above). This could be worth further experiments. Also, in the non-ZEROS
version of the macro we can easily roll from round 16 and on - and we probably should - but need a benchmark. (I had previously tried replacing usage of the ZEROS
macro with non-ZEROS
one in the PBKDF2-HMAC-SHA512 format, and it didn't make a difference - which suggests that the same optimization would be applicable to non-ZEROS
, but ideally we need to benchmark it in formats that actually use it.)
Equivalent change in SHA256_ZEROS
makes no obvious speed difference, and our speed at PBKDF2-HMAC-SHA256 matches hashcat's.
I don't immediately see any performance-critical uses of non-ZEROS
SHA512
in our kernels. All such uses appear to be out of loops. Besides, there isn't a register allocation this high reported for those other formats.
So it looks like this change is only important in SHA512_ZEROS
when used from the PBKDF2 loop, but it doesn't hurt other uses of that macro in my tests so far - e.g., bitcoin-opencl
performance stays almost unchanged.
Should also test on other devices.
This also speeds up bitcoin-opencl
on Vega 64 from 4200-4300 to 4600+. And the change in SHA256_ZEROS
speeds up pbkdf2-hmac-sha256-opencl
on the Vega 64 from ~1590K to ~1640K, so I'll probably get it in as well.
Since these changes are to a header file, they don't trigger an automatic rebuild of the affected kernels. This is especially bad on NVIDIA, where the caching is in ~/.nv
, so even a newly cloned git tree in a new directory may still use old kernels. We could want to start a discussion on addressing this issue.
Tentative commit message:
OpenCL SHA-2: Roll the loops in SHA256_ZEROS and SHA512_ZEROS
Reduces code size and register pressure, and speeds up OpenCL formats using
this code on many devices while not hurting others tested so far. Examples
include +50% for pbkdf2-hmac-sha512-opencl on NVIDIA Maxwell and Pascal GPUs,
+33% for tezos-opencl on GTX 1080, +3% to +10% for pbkdf2-hmac-sha256-opencl,
pbkdf2-hmac-sha512-opencl, and bitcoin-opencl on AMD Vega 64.
Since this change is to an OpenCL header file only, you need to force OpenCL
kernels rebuild with "make kernel-cache-clean" for the change to take effect.
Since these changes are to a header file, they don't trigger an automatic rebuild of the affected kernels. This is especially bad on NVIDIA, where the caching is in
~/.nv
Even weirder is when you then run benchmarks from an old version's directory and suddenly get the improved speeds. I was puzzled for a moment.
Identified a regression - bitcoin-opencl
on Titan Kepler with previously manually tuned LWS=768
has its speed reduce from 1364 to 1200, while having its speed unchanged at 1053 with auto-tuned LWS=128
. I might introduce an exception for it, where we'd use old code.
Testing with the old Catalyst on Tahiti, there's a speedup of maybe 0.5% for both SHA-256 and SHA-512, however pbkdf2-hmac-sha512-opencl
is still very slow, at about 34% of the performance of bitcoin-opencl
when I calculate the rate of individual SHA-512 computations. I guess the code size or/and register pressure is still too high there. Well, at least it's not a regression.
on Titan Kepler [...] introduce an exception for it, where we'd use old code.
I think I'll try conditional #pragma unroll
on the newly introduced loop first. There's probably no need to unroll that loop manually like we had it before.
I think I'll try conditional
#pragma unroll
on the newly introduced loop
Can't do that - #pragma
not allowed inside a macro.
C99 has _Pragma
, maybe OpenCL has that too?
C99 has
_Pragma
, maybe OpenCL has that too?
I don't know, but I've already proceeded to use and clean up the macros and I like what I'm getting, so no need.
Just for the record I tried using _Pragma("unroll") \
within a macro on macOS and on Linux/NVIDIA and it seems to work just fine.
A conditional such thing becomes tad trickier: Does this work?
# if foo
#define UNROLL "unroll 8"
#elif bar
#define UNROLL "unroll 16"
#else
#define UNROLL ""
#endif
#define macro foobar(...) \
(...)
_Pragma(UNROLL) \
for (...)
(...)
So I tried it,
and it works fine! This is good to know. Note the ""
for the "no unroll" case - just defining an empty macro UNROLL
would result in error: _Pragma takes a parenthesized string literal.
Since these changes are to a header file, they don't trigger an automatic rebuild of the affected kernels. This is especially bad on NVIDIA, where the caching is in
~/.nv
Even weirder is when you then run benchmarks from an old version's directory and suddenly get the improved speeds. I was puzzled for a moment.
Perhaps we should always ignore a cached kernel (of our own) if its timestamp is older than our john binary? We could omit that check for cached DEScrypt salt-specific kernels, it's a separate code path anyway (I think).
It wouldn't solve the problem with NVIDIA though - that's a bug in their cache handling we simply can't get around. It shouldn't be hard for them to fix, if we reported it.
Perhaps we should always ignore a cached kernel (of our own) if its timestamp is older than our john binary? We could omit that check for cached DEScrypt salt-specific kernels, it's a separate code path anyway (I think).
This (and more) was implemented in #4913
It wouldn't solve the problem with NVIDIA though - that's a bug in their cache handling we simply can't get around. It shouldn't be hard for them to fix, if we reported it.
Per my testing with 465.19.01, nvidia no longer has the cache problem. If I break the PUTCHAR macro in opencl_misc.h, the format starts failing. Undo that change, the format passes again. I have absolutely no idea how long ago it was fixed but I'm pretty sure it HAD the problem at some point.
Hopefully, those problems are things of the past now!
just defining an empty macro
UNROLL
would result in _error: Pragma takes a parenthesized string literal.
Perhaps a better fix to that would be to include the keyword _Pragma
as part of the UNROLL
macro, not add it externally?
Testing with the old Catalyst on Tahiti [...]
pbkdf2-hmac-sha512-opencl
is still very slow, at about 34% of the performance ofbitcoin-opencl
when I calculate the rate of individual SHA-512 computations.
FWIW, I've just tried using the generic SHA512
macro with the loop fully rolled (rounds 16-79 in the loop) instead of SHA512_ZEROS
in pbkdf2_sha512_loop
hoping that the code size reduction could help, but that only hurt performance on that old system by 1%. So I guess the problem is register spills.
Edit: I previously wrote "by 3%", but that was wrong.
Bad news: got self-test fails on Vega 64 with AMDGPU-Pro when trying to also roll the loops in non-ZEROS
versions of the macros. That's unfortunate - I liked the consistent approach and reuse of sub-macros that it enabled.
Look guys, i will not lie, i am bruteforcing solana wallets for a year now - i have two amd threadrippers but we all know that bruteforcing on cpu sucks.
So i want to move pbkdf2-hmac-sha512 for seed generation on cuda or openCL.
But for some reason i cannot make it generate the right seed...
Any help will be appreciated
Our current speed,
Hashcat's speed,
It seems that our OpenCL
PBKDF2-HMAC-SHA1
stuff is solid but ourPBKDF2-HMAC-SHA512-opencl
stuff isn't the best.Hopefully, I am comparing comparable things here.
@magnumripper This will require your help.