Optimize our OpenCL PBKDF2-HMAC-SHA512 code

kholia commented 5 years ago

Our current speed,

$ ../run/john --format=PBKDF2-HMAC-SHA512-opencl --test 
Device 0: GeForce GTX 1050 Ti
Benchmarking: PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 OpenCL]
Speed for cost 1 (iteration count) of 1000
Raw:    72817 c/s real, 72817 c/s virtual, GPU util: 100%

Hashcat's speed,

$ ./hashcat --optimized-kernel-enable -b -m 12100 -w 4
hashcat (v5.1.0-39-gdbbba1fb) starting in benchmark mode...

Hashmode: 12100 - PBKDF2-HMAC-SHA512 (Iterations: 999)
Speed.#1.........:   130.8 kH/s (330.53ms) @ Accel:128 Loops:124 Thr:512 Vec:1

It seems that our OpenCL PBKDF2-HMAC-SHA1 stuff is solid but our PBKDF2-HMAC-SHA512-opencl stuff isn't the best.

Hopefully, I am comparing comparable things here.

@magnumripper This will require your help.

magnumripper commented 5 years ago

I had a look, it's just as optimized as the other ones. However, SHA-512 is 64-bit while all current GPUs are 32-bit. So the compiler has to make 32-bit code for all 64-bit ops, making it half as fast - best case. I recall Atom mentioned he found a nice trick to make SHA-512 significantly faster but that was before they went open source so I have no idea what it was. Perhaps we should study their kernels!

kholia commented 5 years ago

In pure dictionary mode, hashcat's speeds is around 97,000 c/s.

solardiz commented 5 years ago

As I recall, our sha512crypt-opencl is actually on par or even slightly faster than hashcat's equivalent as of the last time I tested. I think the optimizations to 64-bit rotate that I introduced a couple(?) of years ago are on par with whatever alternative hashcat had. We need to re-do this comparison at sha512crypt-opencl on both the Titan X Maxwell and @kholia's 1050 Ti, as well as possibly on other GPUs, and be also comparing speeds at PBKDF2-HMAC-SHA512-opencl at the same time. Also compare actual cracking runs with same-length mask mode for both tools (can choose length 7 try match hashcat's banchmark mode's test vector).

magnumripper commented 5 years ago

the optimizations to 64-bit rotate that I introduced a couple(?) of years ago

What optimizations was that? I only see this, specific for AMD (the one actually used now is slightly more complicated to work around some AMD bug when n is a multiple of 8):

#define ror64(x, n)      ((n) < 32 ? \
    (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | \
        ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) \
    : \
    (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n) - 32) | \
        ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n) - 32) << 32)))

magnumripper commented 5 years ago

the one actually used now is slightly more complicated to work around some AMD bug when n is a multiple of 8

Apparently that workaround is only used in some formats. Since we don't seem to see any bug I'm going to remove it. There are rotates with multiples of 8 in SHA-512.

solardiz commented 5 years ago

Hmm, yes - it was probably only this one for AMD. As I recall, we determined that on NVIDIA the funnel shift instructions are used for rotate() by the compiler, so no tricks in our code are needed. (We only had to use inline asm for them on NVIDIA in md5crypt, where the uses are trickier than for a rotate.)

As to removing the workaround for multiples of 8, I'd find john-dev postings where we discussed it first. It might still be relevant. I hear you regarding us not using it consistently anyway, though.

magnumripper commented 5 years ago

I'm pretty sure there were no john-dev postings, I recall I added it because Atom mentioned it somewhere (and was referring to ancient drivers IIRC). So I actually never saw the bug and since the other formats (eg. Claudio's) haven't had the problem I'm pretty sure we're OK without it.

solardiz commented 5 years ago

There were john-dev postings, see this and the rest of that thread:

https://www.openwall.com/lists/john-dev/2015/10/19/2

The thread also mentions PBKDF2-HMAC-SHA512-opencl being unexpectedly slower than sha512crypt-opencl (perhaps in terms of SHA-512s per second).

Also see #1840.

magnumripper commented 5 years ago

Oh, right. Well, we'll see if we hit a problem somewhere - then perhaps we can re-apply the workaround selectively.

magnumripper commented 5 years ago

My Macbook's AMD is faster with JtR (not even using bitalign at all) than with hashcat:

Device 2: AMD Radeon Pro 560 Compute Engine
Benchmarking: PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 OpenCL]... DONE
Speed for cost 1 (iteration count) of 1000
Raw:    30624 c/s real, 3276K c/s virtual

* Device #3: AMD Radeon Pro 560 Compute Engine, 1024/4096 MB allocatable, 16MCU
Speed.#3.........:    25759 H/s (70.61ms) @ Accel:32 Loops:15 Thr:256 Vec:1

magnumripper commented 5 years ago

Super's Titan X (with banned-old driver so using --force) is much faster with hashcat though:

Hashmode: 12100 - PBKDF2-HMAC-SHA512 (Iterations: 999)

Speed.#7.........:   302.9 kH/s (58.22ms) @ Accel:64 Loops:31 Thr:384 Vec:1

Device 6: GeForce GTX TITAN X
Benchmarking: PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 OpenCL]... DONE
Speed for cost 1 (iteration count) of 1000
Raw:    177124 c/s real, 175542 c/s virtual, GPU util: 99%

I can see they do a few things using 32-bit halves but nothing revolutionary or that I think should matter much (perhaps it does though, but that would be extremely poor compiler/optimizer performance by the runtime). The main functions, Maj, Ch and sigmas, are standard stuff. I have no idea why they are so much faster, it almost looks like we do a lot of redundant work somewhere, by mistake. But that should show up on my Macbook's AMD as well.

magnumripper commented 5 years ago

Latest hashcat & jumbo

hashcat (v5.1.0-797-g5a1d929) starting in benchmark mode...
* Device #1: gfx900, 4048/8176 MB allocatable, 64MCU
* Device #4: GeForce GTX 1080, 2029/8119 MB allocatable, 20MCU
Hashmode: 12100 - PBKDF2-HMAC-SHA512 (Iterations: 999)

Speed.#1.........:   352.0 kH/s (86.10ms) @ Accel:128 Loops:62 Thr:64 Vec:1
Speed.#4.........:   341.8 kH/s (52.37ms) @ Accel:256 Loops:124 Thr:32 Vec:1

Device 1: gfx900 [Radeon RX Vega]
Benchmarking: PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 OpenCL]... DONE
Speed for cost 1 (iteration count) of 1000
Raw:    324435 c/s real, 8192K c/s virtual

Device 4: GeForce GTX 1080
Benchmarking: PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 OpenCL]... DONE
Speed for cost 1 (iteration count) of 1000
Raw:    254015 c/s real, 254015 c/s virtual, GPU util: 100%

magnumripper commented 5 years ago

This still beats me

solardiz commented 2 years ago

Looks like the register pressure in pbkdf2_sha512_loop is too high on NVIDIA:

ptxas info    : Compiling entry function 'pbkdf2_sha512_loop' for 'sm_61'
ptxas info    : Function properties for pbkdf2_sha512_loop
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 162 registers, 336 bytes cmem[0], 8 bytes cmem[2]

Can we reasonably reduce it?

solardiz commented 2 years ago

This didn't help (still 162 registers and roughly the same speed):

+++ b/run/opencl/pbkdf2_hmac_sha512_kernel.cl
@@ -123,20 +123,17 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
        uint i, rounds = state[idx].rounds;
        uint r = MIN(rounds, HASH_LOOPS);
        ulong W[16];
-       ulong ipad_state[8];
-       ulong opad_state[8];
        ulong tmp_out[8];

        for (i = 0; i < 8; i++) {
                W[i] = state[idx].W[i];
-               ipad_state[i] = state[idx].ipad[i];
-               opad_state[i] = state[idx].opad[i];
                tmp_out[i] = state[idx].hash[i];
        }

        for (i = 0; i < r; i++) {
                ulong A, B, C, D, E, F, G, H, t;

+#define ipad_state state[idx].ipad
                A = ipad_state[0];
                B = ipad_state[1];
                C = ipad_state[2];
@@ -159,9 +156,11 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
                W[5] = F + ipad_state[5];
                W[6] = G + ipad_state[6];
                W[7] = H + ipad_state[7];
+#undef ipad_state
                W[8] = 0x8000000000000000UL;
                W[15] = 0x600;

+#define opad_state state[idx].opad
                A = opad_state[0];
                B = opad_state[1];
                C = opad_state[2];
@@ -181,6 +180,7 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
                W[5] = F += opad_state[5];
                W[6] = G += opad_state[6];
                W[7] = H += opad_state[7];
+#undef opad_state

                tmp_out[0] ^= A;
                tmp_out[1] ^= B;

solardiz commented 2 years ago

Also getting rid of tmp_out reduces the number of registers from 162 to 158, almost no change in speed. (I guess we need at most 128.)

+++ b/run/opencl/pbkdf2_hmac_sha512_kernel.cl
@@ -123,20 +123,15 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
        uint i, rounds = state[idx].rounds;
        uint r = MIN(rounds, HASH_LOOPS);
        ulong W[16];
-       ulong ipad_state[8];
-       ulong opad_state[8];
-       ulong tmp_out[8];

        for (i = 0; i < 8; i++) {
                W[i] = state[idx].W[i];
-               ipad_state[i] = state[idx].ipad[i];
-               opad_state[i] = state[idx].opad[i];
-               tmp_out[i] = state[idx].hash[i];
        }

        for (i = 0; i < r; i++) {
                ulong A, B, C, D, E, F, G, H, t;

+#define ipad_state state[idx].ipad
                A = ipad_state[0];
                B = ipad_state[1];
                C = ipad_state[2];
@@ -159,9 +154,11 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
                W[5] = F + ipad_state[5];
                W[6] = G + ipad_state[6];
                W[7] = H + ipad_state[7];
+#undef ipad_state
                W[8] = 0x8000000000000000UL;
                W[15] = 0x600;

+#define opad_state state[idx].opad
                A = opad_state[0];
                B = opad_state[1];
                C = opad_state[2];
@@ -181,7 +178,9 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
                W[5] = F += opad_state[5];
                W[6] = G += opad_state[6];
                W[7] = H += opad_state[7];
+#undef opad_state

+#define tmp_out state[idx].hash
                tmp_out[0] ^= A;
                tmp_out[1] ^= B;
                tmp_out[2] ^= C;
@@ -198,7 +197,6 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,

        if (rounds) { // there is still work to do
                for (i = 0; i < 8; i++) {
-                       state[idx].hash[i] = tmp_out[i];
                        state[idx].W[i] = W[i];
                }
        }
@@ -206,6 +204,7 @@ __kernel void pbkdf2_sha512_loop(__global state_t *state,
                for (i = 0; i < 8; i++)
                        out[idx].hash[i] = tmp_out[i];
        }
+#undef tmp_out
 }

 __kernel void pbkdf2_sha512_kernel(__global const pass_t *inbuffer,

magnumripper commented 2 years ago

I just confirmed we're still faster than hashcat. I really wanted to see the ptxas info for hashcat but I see no way to do so, short of hacking the code.

solardiz commented 2 years ago

I just confirmed we're still faster than hashcat.

I assume you mean slower.

solardiz commented 2 years ago

BTW, for when we fix this issue, tezos-opencl uses a revised copy of the same code and would need to have any fixes ported over to it (or switched to using shared code). Maybe there are more copies for other formats, I didn't check.

Edit: I was wrong, tezos-opencl only has separate uses of SHA-512 in its init (and soon final) kernel(s). It reuses the shared kernel for the main PBKDF2 loop, without code duplication.

magnumripper commented 2 years ago

I just confirmed we're still faster than hashcat.

I assume you mean slower.

No. The "still" was relating to this older post of yours:

As I recall, our sha512crypt-opencl is actually on par or even slightly faster than hashcat's equivalent as of the last time I tested.

Having tested a bit more, I'd say we're still on par - sometimes faster, sometimes slower. That's with a 2080ti.

solardiz commented 2 years ago

A relevant difference between our code in opencl/opencl_sha2.h and opencl/cryptsha512_kernel_GPU.cl appears to be that the former expands W for rounds 17 to 80 on the fly, whereas the latter writes into w before each group of 16 rounds (like compact implementations of SHA-512 typically do). hashcat also uses the latter approach in its OpenCL/inc_hash_sha512.cl. I guess our on the fly expansion results in excessive register pressure, where the compiler somehow fails to figure out that it can discard some past values already (even though I find this surprising as it shouldn't be that hard a task for the compiler here).

@magnumripper Would you like to try reworking the code accordingly soon or should I or someone else try?

magnumripper commented 2 years ago

That's interesting, I'll bite

solardiz commented 2 years ago

I was wrong - missed a write into wi inside the round macros in opencl/opencl_sha2.h. So that "difference" doesn't actually exist. However, one that does is that we have the code manually fully unrolled in opencl/opencl_sha2.h, whereas in opencl/cryptsha512_kernel_GPU.cl and in hashcat's OpenCL/inc_hash_sha512.cl it's loops that the compiler may or may not unroll.

I've just tried partially rolling the loop, and it provides (only) slight reduction in register allocation (162 to 153), but significant speedup. Here's GTX 1080 when it's cold (so max turbo):

[solar@super run]$ ./john -test -form=pbkdf2-hmac-sha512-opencl -dev=4
Device 4: GeForce GTX 1080
Benchmarking: PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 OpenCL]... LWS=128 GWS=81920 (640 blocks) DONE
Speed for cost 1 (iteration count) of 1000
Raw:    370678 c/s real, 367354 c/s virtual, Dev#4 util: 99%

A comment from 2019 above says this was 254k with our code (which I reproduced now) and 342k with hashcat. So I like 370k.

+++ b/run/opencl/opencl_sha2.h
@@ -627,6 +627,25 @@ __constant ulong K[] = {
        ROUND512_B(D,E,F,G,H,A,B,C,K[29],W[13],  W[11],z,z,W[6]) \
        ROUND512_B(C,D,E,F,G,H,A,B,K[30],W[14],  W[12],W[15],z,W[7]) \
        ROUND512_B(B,C,D,E,F,G,H,A,K[31],W[15],  W[13],W[0],W[15],W[8]) \
+       for (uint i = 32; i < 80; i += 16) { \
+       ROUND512_B(A,B,C,D,E,F,G,H,K[i],W[0],  W[14],W[1],W[0],W[9]) \
+       ROUND512_B(H,A,B,C,D,E,F,G,K[i+1],W[1],  W[15],W[2],W[1],W[10]) \
+       ROUND512_B(G,H,A,B,C,D,E,F,K[i+2],W[2],  W[0],W[3],W[2],W[11]) \
+       ROUND512_B(F,G,H,A,B,C,D,E,K[i+3],W[3],  W[1],W[4],W[3],W[12]) \
+       ROUND512_B(E,F,G,H,A,B,C,D,K[i+4],W[4],  W[2],W[5],W[4],W[13]) \
+       ROUND512_B(D,E,F,G,H,A,B,C,K[i+5],W[5],  W[3],W[6],W[5],W[14]) \
+       ROUND512_B(C,D,E,F,G,H,A,B,K[i+6],W[6],  W[4],W[7],W[6],W[15]) \
+       ROUND512_B(B,C,D,E,F,G,H,A,K[i+7],W[7],  W[5],W[8],W[7],W[0]) \
+       ROUND512_B(A,B,C,D,E,F,G,H,K[i+8],W[8],  W[6],W[9],W[8],W[1]) \
+       ROUND512_B(H,A,B,C,D,E,F,G,K[i+9],W[9],  W[7],W[10],W[9],W[2]) \
+       ROUND512_B(G,H,A,B,C,D,E,F,K[i+10],W[10],  W[8],W[11],W[10],W[3]) \
+       ROUND512_B(F,G,H,A,B,C,D,E,K[i+11],W[11],  W[9],W[12],W[11],W[4]) \
+       ROUND512_B(E,F,G,H,A,B,C,D,K[i+12],W[12],  W[10],W[13],W[12],W[5]) \
+       ROUND512_B(D,E,F,G,H,A,B,C,K[i+13],W[13],  W[11],W[14],W[13],W[6]) \
+       ROUND512_B(C,D,E,F,G,H,A,B,K[i+14],W[14],  W[12],W[15],W[14],W[7]) \
+       ROUND512_B(B,C,D,E,F,G,H,A,K[i+15],W[15],  W[13],W[0],W[15],W[8]) }
+
+#define disab \
        ROUND512_B(A,B,C,D,E,F,G,H,K[32],W[0],  W[14],W[1],W[0],W[9]) \
        ROUND512_B(H,A,B,C,D,E,F,G,K[33],W[1],  W[15],W[2],W[1],W[10]) \
        ROUND512_B(G,H,A,B,C,D,E,F,K[34],W[2],  W[0],W[3],W[2],W[11]) \

This leaves the first 32 rounds unrolled. The first 16 are genuinely different, the second 16 use some zeroes (can instead read zeroes from W if made part of the rolled loop introduced above). This could be worth further experiments. Also, in the non-ZEROS version of the macro we can easily roll from round 16 and on - and we probably should - but need a benchmark. (I had previously tried replacing usage of the ZEROS macro with non-ZEROS one in the PBKDF2-HMAC-SHA512 format, and it didn't make a difference - which suggests that the same optimization would be applicable to non-ZEROS, but ideally we need to benchmark it in formats that actually use it.)

solardiz commented 2 years ago

Equivalent change in SHA256_ZEROS makes no obvious speed difference, and our speed at PBKDF2-HMAC-SHA256 matches hashcat's.

I don't immediately see any performance-critical uses of non-ZEROS SHA512 in our kernels. All such uses appear to be out of loops. Besides, there isn't a register allocation this high reported for those other formats.

So it looks like this change is only important in SHA512_ZEROS when used from the PBKDF2 loop, but it doesn't hurt other uses of that macro in my tests so far - e.g., bitcoin-opencl performance stays almost unchanged.

Should also test on other devices.

solardiz commented 2 years ago

This also speeds up bitcoin-opencl on Vega 64 from 4200-4300 to 4600+. And the change in SHA256_ZEROS speeds up pbkdf2-hmac-sha256-opencl on the Vega 64 from ~1590K to ~1640K, so I'll probably get it in as well.

solardiz commented 2 years ago

Since these changes are to a header file, they don't trigger an automatic rebuild of the affected kernels. This is especially bad on NVIDIA, where the caching is in ~/.nv, so even a newly cloned git tree in a new directory may still use old kernels. We could want to start a discussion on addressing this issue.

solardiz commented 2 years ago

Tentative commit message:

OpenCL SHA-2: Roll the loops in SHA256_ZEROS and SHA512_ZEROS

Reduces code size and register pressure, and speeds up OpenCL formats using
this code on many devices while not hurting others tested so far.  Examples
include +50% for pbkdf2-hmac-sha512-opencl on NVIDIA Maxwell and Pascal GPUs,
+33% for tezos-opencl on GTX 1080, +3% to +10% for pbkdf2-hmac-sha256-opencl,
pbkdf2-hmac-sha512-opencl, and bitcoin-opencl on AMD Vega 64.

Since this change is to an OpenCL header file only, you need to force OpenCL
kernels rebuild with "make kernel-cache-clean" for the change to take effect.

solardiz commented 2 years ago

Since these changes are to a header file, they don't trigger an automatic rebuild of the affected kernels. This is especially bad on NVIDIA, where the caching is in ~/.nv

Even weirder is when you then run benchmarks from an old version's directory and suddenly get the improved speeds. I was puzzled for a moment.

solardiz commented 2 years ago

Identified a regression - bitcoin-opencl on Titan Kepler with previously manually tuned LWS=768 has its speed reduce from 1364 to 1200, while having its speed unchanged at 1053 with auto-tuned LWS=128. I might introduce an exception for it, where we'd use old code.

solardiz commented 2 years ago

Testing with the old Catalyst on Tahiti, there's a speedup of maybe 0.5% for both SHA-256 and SHA-512, however pbkdf2-hmac-sha512-opencl is still very slow, at about 34% of the performance of bitcoin-opencl when I calculate the rate of individual SHA-512 computations. I guess the code size or/and register pressure is still too high there. Well, at least it's not a regression.

solardiz commented 2 years ago

on Titan Kepler [...] introduce an exception for it, where we'd use old code.

I think I'll try conditional #pragma unroll on the newly introduced loop first. There's probably no need to unroll that loop manually like we had it before.

solardiz commented 2 years ago

I think I'll try conditional #pragma unroll on the newly introduced loop

Can't do that - #pragma not allowed inside a macro.

magnumripper commented 2 years ago

C99 has _Pragma, maybe OpenCL has that too?

solardiz commented 2 years ago

C99 has _Pragma, maybe OpenCL has that too?

I don't know, but I've already proceeded to use and clean up the macros and I like what I'm getting, so no need.

magnumripper commented 2 years ago

Just for the record I tried using _Pragma("unroll") \ within a macro on macOS and on Linux/NVIDIA and it seems to work just fine.

A conditional such thing becomes tad trickier: Does this work?

# if foo
#define UNROLL "unroll 8"
#elif bar
#define UNROLL "unroll 16"
#else
#define UNROLL ""
#endif

#define macro foobar(...) \
(...)
    _Pragma(UNROLL) \
    for (...)
(...)

So I tried it,

and it works fine! This is good to know. Note the "" for the "no unroll" case - just defining an empty macro UNROLL would result in error: _Pragma takes a parenthesized string literal.

magnumripper commented 2 years ago

Since these changes are to a header file, they don't trigger an automatic rebuild of the affected kernels. This is especially bad on NVIDIA, where the caching is in ~/.nv

Even weirder is when you then run benchmarks from an old version's directory and suddenly get the improved speeds. I was puzzled for a moment.

Perhaps we should always ignore a cached kernel (of our own) if its timestamp is older than our john binary? We could omit that check for cached DEScrypt salt-specific kernels, it's a separate code path anyway (I think).

It wouldn't solve the problem with NVIDIA though - that's a bug in their cache handling we simply can't get around. It shouldn't be hard for them to fix, if we reported it.

magnumripper commented 2 years ago

Perhaps we should always ignore a cached kernel (of our own) if its timestamp is older than our john binary? We could omit that check for cached DEScrypt salt-specific kernels, it's a separate code path anyway (I think).

This (and more) was implemented in #4913

It wouldn't solve the problem with NVIDIA though - that's a bug in their cache handling we simply can't get around. It shouldn't be hard for them to fix, if we reported it.

Per my testing with 465.19.01, nvidia no longer has the cache problem. If I break the PUTCHAR macro in opencl_misc.h, the format starts failing. Undo that change, the format passes again. I have absolutely no idea how long ago it was fixed but I'm pretty sure it HAD the problem at some point.

Hopefully, those problems are things of the past now!

solardiz commented 2 years ago

just defining an empty macro UNROLL would result in _error: Pragma takes a parenthesized string literal.

Perhaps a better fix to that would be to include the keyword _Pragma as part of the UNROLL macro, not add it externally?

solardiz commented 2 years ago

Testing with the old Catalyst on Tahiti [...] pbkdf2-hmac-sha512-opencl is still very slow, at about 34% of the performance of bitcoin-opencl when I calculate the rate of individual SHA-512 computations.

FWIW, I've just tried using the generic SHA512 macro with the loop fully rolled (rounds 16-79 in the loop) instead of SHA512_ZEROS in pbkdf2_sha512_loop hoping that the code size reduction could help, but that only hurt performance on that old system by 1%. So I guess the problem is register spills.

Edit: I previously wrote "by 3%", but that was wrong.

solardiz commented 2 years ago

Bad news: got self-test fails on Vega 64 with AMDGPU-Pro when trying to also roll the loops in non-ZEROS versions of the macros. That's unfortunate - I liked the consistent approach and reuse of sub-macros that it enabled.

Martiparti98 commented 7 hours ago

Look guys, i will not lie, i am bruteforcing solana wallets for a year now - i have two amd threadrippers but we all know that bruteforcing on cpu sucks.

So i want to move pbkdf2-hmac-sha512 for seed generation on cuda or openCL.

But for some reason i cannot make it generate the right seed...

Any help will be appreciated

openwall / john

Optimize our OpenCL PBKDF2-HMAC-SHA512 code #3525