tezos-opencl kernel build takes longer than 10 hours with recent nvidia drivers

magnumripper commented 3 weeks ago

This was during a --test=0 --format=opencl (edit: and a 2080ti) with driver 550.107.02. I killed it after 45 minutes with 100% CPU. Did we see this before?

Also tried -test=0 -form:tezos-opencl by itself, same thing but I interrupted sooner. Will try other drivers (as long as they are readily available as Ubuntu packages).

magnumripper commented 3 weeks ago

With 418.39 (edit: and GTX 1080), a complete john -test=0 -form:tezos-opencl takes ~20 seconds.

magnumripper commented 3 weeks ago

Same problem with 535.183.01. I think I'll try to let it finish and see if it ever does.

magnumripper commented 3 weeks ago

I think I'll try to let it finish and see if it ever does.

I gave up and aborted after 10 hours. Meanwhile I tried an Apple M1 laptop and with nothing cached it completes a time john -test=0 -form:tezos-opencl in seven seconds wall clock (and it does pass the test). On the intel macbook (three devices) the CPU runs it in 12s, the UHD 630 in 35s and the AMD Vega is very slow (aborted at 25m).

BTW while doing this I discovered that with the M1, macOS has binary caching that we don't include in "make kernel-cache-clean" (subsequent tests complete in half a second) and I have no idea yet where it is. I tried touching the files, or even dropping the whole run/opencl directory and git reset but it still used the cache. Finally I edited the tezos kernel very slightly and that did invalidate the cache. After reverting to the unmodfied tezos kernel, that one was still also cached.

A huge problem with this is that it (just like all others I know of) doesn't detect changes in included source files so until I find out how to clear the cache, OpenCL coding or even testing on that laptop is unfeasible. On the intel macbook, our kernel-cache-clean does the trick.

solardiz commented 3 weeks ago

Thank you for noticing and working on these issues, @magnumripper! All of this is new to me - I've never seen builds of this kernel take anywhere that long (I think it was always under a minute), and I was unaware of this new cache on macOS.

Is this the only kernel we have for which the build (presumably) gets stuck with those drivers? As I understood from your comments, this happens with recent NVIDIA and AMD drivers - that's especially weird.

magnumripper commented 3 weeks ago

Is this the only kernel we have for which the build (presumably) gets stuck

On Linux, yes. I disabled that format and a -test=0 for all OpenCL formats took reasonably short time.

I now downgraded to 470.256.02 and this problem went away. Note that the AMD problem I had was macOS only and that driver is so bad it's nothing to care about IMO.

magnumripper commented 3 weeks ago

FWIW I found out how to drop the build cache from the M1. Unfortunately it needs privileges: sudo killall CVMServer. This works most of the time, sometimes I need to do it again (even though confirming no CVMServer were left running).

Not sure if/how to add that to our kernel-cache-clean target. Anyhow this means the caches shouldn't survive a reboot so the problem isn't (quite) as bad as I thought.

magnumripper commented 3 weeks ago

Anyhow this means the caches shouldn't survive a reboot

Yet they do :-( and I'm giving up here for now.

solardiz commented 2 weeks ago

@magnumripper Are you going to try and find a workaround for tezos-opencl not freezing the build with new NVIDIA drivers? That would help.

claudioandre-br commented 2 weeks ago

Let's imagine that it's impossible to fix the problem, so we need to disable the format using something like this:

diff --git a/src/opencl_tezos_fmt_plug.c b/src/opencl_tezos_fmt_plug.c
index b16ca268c..b991e44c7 100644
--- a/src/opencl_tezos_fmt_plug.c
+++ b/src/opencl_tezos_fmt_plug.c
@@ -166,8 +166,18 @@ static void create_clobj(size_t kpc, struct fmt_main *self)

 static void init(struct fmt_main *_self)
 {
+   int major, minor;
+
    self = _self;
    opencl_prepare_dev(gpu_id);
+
+   opencl_driver_value(gpu_id, &major, &minor);
+
+   if (major == 5 && minor == 0) { // && macOS && ...
+       fprintf(stderr,
+               "The OpenCL driver in use cannot run this kernel. Please, change your driver!\n");
+       error();
+   }
 }

 static void reset(struct db_main *db)

The problem that arises from this solution is that I'm not just disabling a format, I'm making it impossible from now on to run a --test=0 --format=opencl, i.e. the below will fail in an unwanted way:

$ ../run/john --test --format=tezos,tezos-opencl,raw-sha256
Will run 8 OpenMP threads
Benchmarking: tezos, Tezos Key [PBKDF2-SHA512 256/256 AVX2 4x]... (8xOMP) DONE
Raw:    2154 c/s real, 344 c/s virtual

Device 1: cpu-haswell-AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx
The OpenCL driver in use cannot run this kernel. Please, update your driver!

Is it feasible to disable a format after init() through a small/localized change?

I need the driver version during runtime and I want to silently disable (turn off) only one format.

magnumripper commented 2 weeks ago

@magnumripper Are you going to try and find a workaround for tezos-opencl not freezing the build with new NVIDIA drivers? That would help.

Not sure where to start really, except the usual "throw things at it and see what sticks". It's obviously a compiler bug and with some luck it will go away with later drivers.

Also, I'm "struggling" with final details before contributing pdf-opencl (will open a ticket with a RFC on how to proceed) and after that I should try merging my Argon2 support for Keepass and Keepass-opencl that stalled a year ago. I had it 99% finished when Alain suddenly contributed argon2-opencl, making my keepass-argon2 branch impossible to rebase without non-trivial conflicts. I'm not complaining, I didn't give anyone any heads-up myself, lol. I should probably rewrite Keepass-opencl to use his Argon2 stuff but I have a hard time following his code.

solardiz commented 2 weeks ago

Is it feasible to disable a format after init() through a small/localized change?

A dirty hack that comes to mind is to check (only the first time this code runs) in valid() just before it'd return non-zero, print a message, and return zero. This wouldn't be strictly "after init()" - in fact, for actual cracking runs it'd be "before init()".

Not sure where to start really, except the usual "throw things at it and see what sticks". It's obviously a compiler bug

Yes. Something like bisecting our code to see what triggers the bug. It's no problem if during such bisecting the computation would be wrong (failing self-test) - the point is to see what gets past the compiler and what does not.

I should try merging my Argon2 support for Keepass and Keepass-opencl that stalled a year ago. I had it 99% finished

Please open an issue for that, so that we'd remember to ask you e.g. before making a release. Thank you!

magnumripper commented 1 week ago

First clue (this compiles in 6s)

diff --git a/run/opencl/tezos_kernel.cl b/run/opencl/tezos_kernel.cl
index 2625f0d9e..8ad99ed02 100644
--- a/run/opencl/tezos_kernel.cl
+++ b/run/opencl/tezos_kernel.cl
@@ -207,7 +207,7 @@ __kernel void pbkdf2_sha512_tezos_final(__global const crack_t *in, __constant t
        memcpy_macro(sk.u64, in[idx].hash, 4);
        for (int i = 0; i < 4; i++)
                sk.u64[i] = SWAP64(sk.u64[i]);
-       ed25519_publickey(sk.uc, pk);
+       //ed25519_publickey(sk.uc, pk);
        blake2b(pk, 20, NULL, 0, pk, sizeof(pk)); /* Replace pk with pkh */
        if (!memcmp_pc(pk, gsalt->pkh, 20)) {
                atomic_inc(out);

magnumripper commented 1 week ago

This compiles in 6s

diff --git a/run/opencl/ed25519-donna/ed25519-donna.c b/run/opencl/ed25519-donna/ed25519-donna.c
index 09ef8f9f5..a05432b88 100644
--- a/run/opencl/ed25519-donna/ed25519-donna.c
+++ b/run/opencl/ed25519-donna/ed25519-donna.c
@@ -29,6 +29,6 @@ ed25519_publickey(const ed25519_secret_key sk, ed25519_public_key pk) {
        /* A = aB */
        ed25519_extsk(extsk, sk);
        expand256_modm(a, extsk, 32);
-       ge25519_scalarmult_base_niels(&A, a);
+       //ge25519_scalarmult_base_niels(&A, a);
        ge25519_pack(pk, &A);
 }

magnumripper commented 1 week ago

This compiles in 10.5s

diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..c5b2b2043 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -129,6 +129,7 @@ ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
        contract256_window4_modm(b, s);

        ge25519_scalarmult_base_choose_niels(&t, 0, b[1]);
+#if 0
        curve25519_sub_reduce(r->x, t.xaddy, t.ysubx);
        curve25519_add_reduce(r->y, t.xaddy, t.ysubx);
        memset_p(r->z, 0, sizeof(bignum25519));
@@ -138,6 +139,7 @@ ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
                ge25519_scalarmult_base_choose_niels(&t, i / 2, b[i]);
                ge25519_nielsadd2(r, &t);
        }
+#endif
        ge25519_double_partial(r, r);
        ge25519_double_partial(r, r);
        ge25519_double_partial(r, r);

If I move the start or end of the #ifdef to narrow it down, problem comes back. I only allowed 12 seconds for building though, maybe I need to bump it a little. Edit: Tried 15s, no dice. Giving up here for now.

magnumripper commented 1 week ago

Giving up here for now.

Just kidding. The above was a red herring (stuff were probably optimized away). This compiles in 12.5s:

diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..bb7713d2f 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -144,7 +144,7 @@ ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
        ge25519_double(r, r);
        ge25519_scalarmult_base_choose_niels(&t, 0, b[0]);
        curve25519_mul_const(t.t2d, t.t2d, ge25519_ecd);
-       ge25519_nielsadd2(r, &t);
+       //ge25519_nielsadd2(r, &t);
        for(i = 2; i < 64; i += 2) {
                ge25519_scalarmult_base_choose_niels(&t, i / 2, b[i]);
                ge25519_nielsadd2(r, &t);

Doesn't make sense: There are other calls to ge25519_nielsadd2() above and below it that doesn't have a problem.

magnumripper commented 1 week ago

Tried this reordering (that I think would have been non destructive) but it didn't help.

diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..aa4034fd2 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -144,9 +144,9 @@ ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
        ge25519_double(r, r);
        ge25519_scalarmult_base_choose_niels(&t, 0, b[0]);
        curve25519_mul_const(t.t2d, t.t2d, ge25519_ecd);
-       ge25519_nielsadd2(r, &t);
        for(i = 2; i < 64; i += 2) {
-               ge25519_scalarmult_base_choose_niels(&t, i / 2, b[i]);
                ge25519_nielsadd2(r, &t);
+               ge25519_scalarmult_base_choose_niels(&t, i / 2, b[i]);
        }
+       ge25519_nielsadd2(r, &t);
 }

magnumripper commented 1 week ago

These alternatives also didn't help. I'm out of ideas. Will grab a pizza to boost my randomness-fu.

diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..bc44fb0f2 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -145,8 +145,8 @@ ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
        ge25519_scalarmult_base_choose_niels(&t, 0, b[0]);
        curve25519_mul_const(t.t2d, t.t2d, ge25519_ecd);
        ge25519_nielsadd2(r, &t);
-       for(i = 2; i < 64; i += 2) {
-               ge25519_scalarmult_base_choose_niels(&t, i / 2, b[i]);
+       for(i = 1; i < 32; i++) {
+               ge25519_scalarmult_base_choose_niels(&t, i, b[i * 2]);
                ge25519_nielsadd2(r, &t);
        }
 }

diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..9cae2a55c 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -144,9 +144,9 @@ ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
        ge25519_double(r, r);
        ge25519_scalarmult_base_choose_niels(&t, 0, b[0]);
        curve25519_mul_const(t.t2d, t.t2d, ge25519_ecd);
-       ge25519_nielsadd2(r, &t);
-       for(i = 2; i < 64; i += 2) {
-               ge25519_scalarmult_base_choose_niels(&t, i / 2, b[i]);
+       for(i = 1; i < 32; i++) {
                ge25519_nielsadd2(r, &t);
+               ge25519_scalarmult_base_choose_niels(&t, i, b[i * 2]);
        }
+       ge25519_nielsadd2(r, &t);
 }

solardiz commented 1 week ago

Maybe do something to prevent inlining of ge25519_nielsadd2, so it'd be compiled on its own. I wonder if e.g. __attribute__(noinline) (or some other syntax?) maybe just happens to work with recent NVIDIA compiler? (Never tried that. But it'd be useful for us to have a way to selectively prevent function inlining in OpenCL in general.)

claudioandre-br commented 1 week ago

BTW: not long ago, there was a discussion about inline hell in OpenCL. See, for example, https://gitlab.itp.uni-frankfurt.de/lattice-qcd/ag-philipsen/cl2qcd/-/issues/37

The most important aspects are:

In C99, inline has a meaning that applies to OpenCL;
inline keyword "alone" seems to be basically wrong in OpenCL;
compilers always inlined or they wrongly implemented (or simply ignored) the inline keyword;
but this is changing.

Support for a wide variety of compilers will be complicated or even impossible. Well, people with hardware can try.

magnumripper commented 1 week ago

Maybe do something to prevent inlining of ge25519_nielsadd2, so it'd be compiled on its own. I wonder if e.g. __attribute__(noinline) (or some other syntax?) maybe just happens to work with recent NVIDIA compiler? (Never tried that. But it'd be useful for us to have a way to selectively prevent function inlining in OpenCL in general.)

Lo and behold, that worked. Builds in 10.7 seconds and passes self-test. Speed for 2080ti is 405623 c/s.

diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..3d72f50e5 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -51,6 +51,9 @@ ge25519_double(ge25519 *r, const ge25519 *p) {
        ge25519_p1p1_to_full(r, &t);
 }

+#if gpu_nvidia(DEVICE_INFO)
+__attribute__((noinline))
+#endif
 static void
 ge25519_nielsadd2(ge25519 *r, const ge25519_niels *q) {
        bignum25519 a,b,c,e,f,g,h;

solardiz commented 1 week ago

That's cool. I wonder if noinline'ing ge25519_scalarmult_base_niels instead would also do the trick? I think doing it for ge25519_nielsadd2 is more likely to have performance impact since some calls are in loops.

If there's almost no performance impact, we can keep it this way for all NVIDIA, otherwise we could limit to known-affected driver versions. Edit: we could also have to limit to driver versions that do support the attribute, as I guess older drivers didn't and could have compile issues for that reason?

magnumripper commented 1 week ago

Yeah speed on super's GTX 1080 (418.39) went from 176551 to 160627 c/s as I posted it. Will try moving it.

magnumripper commented 1 week ago

Noinlining ge25519_scalarmult_base_niels instead boosted it a little on 2080ti but still only 161418 c/s on super. I could go with DEV_VER_MAJOR > 500 for a bit of safety or just DEV_VER_MAJOR >= 535 for what we actually know.

diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..5635ab9a6 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -121,6 +121,9 @@ ge25519_scalarmult_base_choose_niels(ge25519_niels *t, uint32_t pos, signed char

 /* computes [s]basepoint */
 static void
+#if gpu_nvidia(DEVICE_INFO) && DEV_VER_MAJOR > 500
+__attribute__((noinline))
+#endif
 ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
        signed char b[64];
        uint32_t i;

solardiz commented 1 week ago

speed on super's GTX 1080 (418.39) went from 176551 to 160627 c/s

It's weird this had so much effect - this format spends most time elsewhere. Are you able to reliably reproduce the original speed of 176k? Maybe it's unstable or something else changed?

magnumripper commented 1 week ago

I tested 2-3 times each and figures were pretty stable.

Here's an unstable figure for you though: I once, just now, got 507029 c/s on the 2080ti with newer drivers (no noinline). I still have it my scroll buffer. But repeated subsequent tests give only 405623-415713 c/s - even when manually selecting the LWS/GWS autotune picked for that single time faster run. Really weird. And no I am not forgetting the kernel cache clean.

magnumripper commented 1 week ago

As for caching, it appears we no longer need kernel-cache-clean after changing included files on the nvidias (Linux). I'm using 565.57.01 but I've had a feeling this happened a couple of versions ago. That's great, although I don't trust it yet.

openwall / john

tezos-opencl kernel build takes longer than 10 hours with recent nvidia drivers #5546