Closed magnumripper closed 1 week ago
With 418.39 (edit: and GTX 1080), a complete john -test=0 -form:tezos-opencl
takes ~20 seconds.
Same problem with 535.183.01. I think I'll try to let it finish and see if it ever does.
I think I'll try to let it finish and see if it ever does.
I gave up and aborted after 10 hours. Meanwhile I tried an Apple M1 laptop and with nothing cached it completes a time john -test=0 -form:tezos-opencl
in seven seconds wall clock (and it does pass the test). On the intel macbook (three devices) the CPU runs it in 12s, the UHD 630 in 35s and the AMD Vega is very slow (aborted at 25m).
BTW while doing this I discovered that with the M1, macOS has binary caching that we don't include in "make kernel-cache-clean" (subsequent tests complete in half a second) and I have no idea yet where it is. I tried touching the files, or even dropping the whole run/opencl directory and git reset
but it still used the cache. Finally I edited the tezos kernel very slightly and that did invalidate the cache. After reverting to the unmodfied tezos kernel, that one was still also cached.
A huge problem with this is that it (just like all others I know of) doesn't detect changes in included source files so until I find out how to clear the cache, OpenCL coding or even testing on that laptop is unfeasible. On the intel macbook, our kernel-cache-clean does the trick.
Thank you for noticing and working on these issues, @magnumripper! All of this is new to me - I've never seen builds of this kernel take anywhere that long (I think it was always under a minute), and I was unaware of this new cache on macOS.
Is this the only kernel we have for which the build (presumably) gets stuck with those drivers? As I understood from your comments, this happens with recent NVIDIA and AMD drivers - that's especially weird.
Is this the only kernel we have for which the build (presumably) gets stuck
On Linux, yes. I disabled that format and a -test=0 for all OpenCL formats took reasonably short time.
I now downgraded to 470.256.02 and this problem went away. Note that the AMD problem I had was macOS only and that driver is so bad it's nothing to care about IMO.
FWIW I found out how to drop the build cache from the M1. Unfortunately it needs privileges: sudo killall CVMServer
. This works most of the time, sometimes I need to do it again (even though confirming no CVMServer were left running).
Not sure if/how to add that to our kernel-cache-clean target. Anyhow this means the caches shouldn't survive a reboot so the problem isn't (quite) as bad as I thought.
Anyhow this means the caches shouldn't survive a reboot
Yet they do :-( and I'm giving up here for now.
@magnumripper Are you going to try and find a workaround for tezos-opencl not freezing the build with new NVIDIA drivers? That would help.
Let's imagine that it's impossible to fix the problem, so we need to disable the format using something like this:
diff --git a/src/opencl_tezos_fmt_plug.c b/src/opencl_tezos_fmt_plug.c
index b16ca268c..b991e44c7 100644
--- a/src/opencl_tezos_fmt_plug.c
+++ b/src/opencl_tezos_fmt_plug.c
@@ -166,8 +166,18 @@ static void create_clobj(size_t kpc, struct fmt_main *self)
static void init(struct fmt_main *_self)
{
+ int major, minor;
+
self = _self;
opencl_prepare_dev(gpu_id);
+
+ opencl_driver_value(gpu_id, &major, &minor);
+
+ if (major == 5 && minor == 0) { // && macOS && ...
+ fprintf(stderr,
+ "The OpenCL driver in use cannot run this kernel. Please, change your driver!\n");
+ error();
+ }
}
static void reset(struct db_main *db)
The problem that arises from this solution is that I'm not just disabling a format, I'm making it impossible from now on to run a --test=0 --format=opencl
, i.e. the below will fail in an unwanted way:
$ ../run/john --test --format=tezos,tezos-opencl,raw-sha256
Will run 8 OpenMP threads
Benchmarking: tezos, Tezos Key [PBKDF2-SHA512 256/256 AVX2 4x]... (8xOMP) DONE
Raw: 2154 c/s real, 344 c/s virtual
Device 1: cpu-haswell-AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx
The OpenCL driver in use cannot run this kernel. Please, update your driver!
Is it feasible to disable a format after init() through a small/localized change?
I need the driver version during runtime and I want to silently disable (turn off) only one format.
@magnumripper Are you going to try and find a workaround for tezos-opencl not freezing the build with new NVIDIA drivers? That would help.
Not sure where to start really, except the usual "throw things at it and see what sticks". It's obviously a compiler bug and with some luck it will go away with later drivers.
Also, I'm "struggling" with final details before contributing pdf-opencl (will open a ticket with a RFC on how to proceed) and after that I should try merging my Argon2 support for Keepass and Keepass-opencl that stalled a year ago. I had it 99% finished when Alain suddenly contributed argon2-opencl, making my keepass-argon2 branch impossible to rebase without non-trivial conflicts. I'm not complaining, I didn't give anyone any heads-up myself, lol. I should probably rewrite Keepass-opencl to use his Argon2 stuff but I have a hard time following his code.
Is it feasible to disable a format after init() through a small/localized change?
A dirty hack that comes to mind is to check (only the first time this code runs) in valid() just before it'd return non-zero, print a message, and return zero. This wouldn't be strictly "after init()" - in fact, for actual cracking runs it'd be "before init()".
Not sure where to start really, except the usual "throw things at it and see what sticks". It's obviously a compiler bug
Yes. Something like bisecting our code to see what triggers the bug. It's no problem if during such bisecting the computation would be wrong (failing self-test) - the point is to see what gets past the compiler and what does not.
I should try merging my Argon2 support for Keepass and Keepass-opencl that stalled a year ago. I had it 99% finished
Please open an issue for that, so that we'd remember to ask you e.g. before making a release. Thank you!
First clue (this compiles in 6s)
diff --git a/run/opencl/tezos_kernel.cl b/run/opencl/tezos_kernel.cl
index 2625f0d9e..8ad99ed02 100644
--- a/run/opencl/tezos_kernel.cl
+++ b/run/opencl/tezos_kernel.cl
@@ -207,7 +207,7 @@ __kernel void pbkdf2_sha512_tezos_final(__global const crack_t *in, __constant t
memcpy_macro(sk.u64, in[idx].hash, 4);
for (int i = 0; i < 4; i++)
sk.u64[i] = SWAP64(sk.u64[i]);
- ed25519_publickey(sk.uc, pk);
+ //ed25519_publickey(sk.uc, pk);
blake2b(pk, 20, NULL, 0, pk, sizeof(pk)); /* Replace pk with pkh */
if (!memcmp_pc(pk, gsalt->pkh, 20)) {
atomic_inc(out);
This compiles in 6s
diff --git a/run/opencl/ed25519-donna/ed25519-donna.c b/run/opencl/ed25519-donna/ed25519-donna.c
index 09ef8f9f5..a05432b88 100644
--- a/run/opencl/ed25519-donna/ed25519-donna.c
+++ b/run/opencl/ed25519-donna/ed25519-donna.c
@@ -29,6 +29,6 @@ ed25519_publickey(const ed25519_secret_key sk, ed25519_public_key pk) {
/* A = aB */
ed25519_extsk(extsk, sk);
expand256_modm(a, extsk, 32);
- ge25519_scalarmult_base_niels(&A, a);
+ //ge25519_scalarmult_base_niels(&A, a);
ge25519_pack(pk, &A);
}
This compiles in 10.5s
diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..c5b2b2043 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -129,6 +129,7 @@ ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
contract256_window4_modm(b, s);
ge25519_scalarmult_base_choose_niels(&t, 0, b[1]);
+#if 0
curve25519_sub_reduce(r->x, t.xaddy, t.ysubx);
curve25519_add_reduce(r->y, t.xaddy, t.ysubx);
memset_p(r->z, 0, sizeof(bignum25519));
@@ -138,6 +139,7 @@ ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
ge25519_scalarmult_base_choose_niels(&t, i / 2, b[i]);
ge25519_nielsadd2(r, &t);
}
+#endif
ge25519_double_partial(r, r);
ge25519_double_partial(r, r);
ge25519_double_partial(r, r);
If I move the start or end of the #ifdef
to narrow it down, problem comes back. I only allowed 12 seconds for building though, maybe I need to bump it a little. Edit: Tried 15s, no dice. Giving up here for now.
Giving up here for now.
Just kidding. The above was a red herring (stuff were probably optimized away). This compiles in 12.5s:
diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..bb7713d2f 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -144,7 +144,7 @@ ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
ge25519_double(r, r);
ge25519_scalarmult_base_choose_niels(&t, 0, b[0]);
curve25519_mul_const(t.t2d, t.t2d, ge25519_ecd);
- ge25519_nielsadd2(r, &t);
+ //ge25519_nielsadd2(r, &t);
for(i = 2; i < 64; i += 2) {
ge25519_scalarmult_base_choose_niels(&t, i / 2, b[i]);
ge25519_nielsadd2(r, &t);
Doesn't make sense: There are other calls to ge25519_nielsadd2()
above and below it that doesn't have a problem.
Tried this reordering (that I think would have been non destructive) but it didn't help.
diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..aa4034fd2 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -144,9 +144,9 @@ ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
ge25519_double(r, r);
ge25519_scalarmult_base_choose_niels(&t, 0, b[0]);
curve25519_mul_const(t.t2d, t.t2d, ge25519_ecd);
- ge25519_nielsadd2(r, &t);
for(i = 2; i < 64; i += 2) {
- ge25519_scalarmult_base_choose_niels(&t, i / 2, b[i]);
ge25519_nielsadd2(r, &t);
+ ge25519_scalarmult_base_choose_niels(&t, i / 2, b[i]);
}
+ ge25519_nielsadd2(r, &t);
}
These alternatives also didn't help. I'm out of ideas. Will grab a pizza to boost my randomness-fu.
diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..bc44fb0f2 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -145,8 +145,8 @@ ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
ge25519_scalarmult_base_choose_niels(&t, 0, b[0]);
curve25519_mul_const(t.t2d, t.t2d, ge25519_ecd);
ge25519_nielsadd2(r, &t);
- for(i = 2; i < 64; i += 2) {
- ge25519_scalarmult_base_choose_niels(&t, i / 2, b[i]);
+ for(i = 1; i < 32; i++) {
+ ge25519_scalarmult_base_choose_niels(&t, i, b[i * 2]);
ge25519_nielsadd2(r, &t);
}
}
diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..9cae2a55c 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -144,9 +144,9 @@ ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
ge25519_double(r, r);
ge25519_scalarmult_base_choose_niels(&t, 0, b[0]);
curve25519_mul_const(t.t2d, t.t2d, ge25519_ecd);
- ge25519_nielsadd2(r, &t);
- for(i = 2; i < 64; i += 2) {
- ge25519_scalarmult_base_choose_niels(&t, i / 2, b[i]);
+ for(i = 1; i < 32; i++) {
ge25519_nielsadd2(r, &t);
+ ge25519_scalarmult_base_choose_niels(&t, i, b[i * 2]);
}
+ ge25519_nielsadd2(r, &t);
}
Maybe do something to prevent inlining of ge25519_nielsadd2
, so it'd be compiled on its own. I wonder if e.g. __attribute__(noinline)
(or some other syntax?) maybe just happens to work with recent NVIDIA compiler? (Never tried that. But it'd be useful for us to have a way to selectively prevent function inlining in OpenCL in general.)
BTW: not long ago, there was a discussion about inline
hell in OpenCL. See, for example, https://gitlab.itp.uni-frankfurt.de/lattice-qcd/ag-philipsen/cl2qcd/-/issues/37
The most important aspects are:
inline
has a meaning that applies to OpenCL;inline
keyword "alone" seems to be basically wrong in OpenCL;inline
keyword;Support for a wide variety of compilers will be complicated or even impossible. Well, people with hardware can try.
Maybe do something to prevent inlining of
ge25519_nielsadd2
, so it'd be compiled on its own. I wonder if e.g.__attribute__(noinline)
(or some other syntax?) maybe just happens to work with recent NVIDIA compiler? (Never tried that. But it'd be useful for us to have a way to selectively prevent function inlining in OpenCL in general.)
Lo and behold, that worked. Builds in 10.7 seconds and passes self-test. Speed for 2080ti is 405623 c/s.
diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..3d72f50e5 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -51,6 +51,9 @@ ge25519_double(ge25519 *r, const ge25519 *p) {
ge25519_p1p1_to_full(r, &t);
}
+#if gpu_nvidia(DEVICE_INFO)
+__attribute__((noinline))
+#endif
static void
ge25519_nielsadd2(ge25519 *r, const ge25519_niels *q) {
bignum25519 a,b,c,e,f,g,h;
That's cool. I wonder if noinline'ing ge25519_scalarmult_base_niels
instead would also do the trick? I think doing it for ge25519_nielsadd2
is more likely to have performance impact since some calls are in loops.
If there's almost no performance impact, we can keep it this way for all NVIDIA, otherwise we could limit to known-affected driver versions. Edit: we could also have to limit to driver versions that do support the attribute, as I guess older drivers didn't and could have compile issues for that reason?
Yeah speed on super's GTX 1080 (418.39) went from 176551 to 160627 c/s as I posted it. Will try moving it.
Noinlining ge25519_scalarmult_base_niels
instead boosted it a little on 2080ti but still only 161418 c/s on super. I could go with DEV_VER_MAJOR > 500
for a bit of safety or just DEV_VER_MAJOR >= 535
for what we actually know.
diff --git a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
index 89838c542..5635ab9a6 100644
--- a/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
+++ b/run/opencl/ed25519-donna/ed25519-donna-impl-base.h
@@ -121,6 +121,9 @@ ge25519_scalarmult_base_choose_niels(ge25519_niels *t, uint32_t pos, signed char
/* computes [s]basepoint */
static void
+#if gpu_nvidia(DEVICE_INFO) && DEV_VER_MAJOR > 500
+__attribute__((noinline))
+#endif
ge25519_scalarmult_base_niels(ge25519 *r, const bignum256modm s) {
signed char b[64];
uint32_t i;
speed on super's GTX 1080 (418.39) went from 176551 to 160627 c/s
It's weird this had so much effect - this format spends most time elsewhere. Are you able to reliably reproduce the original speed of 176k? Maybe it's unstable or something else changed?
I tested 2-3 times each and figures were pretty stable.
Here's an unstable figure for you though: I once, just now, got 507029 c/s on the 2080ti with newer drivers (no noinline). I still have it my scroll buffer. But repeated subsequent tests give only 405623-415713 c/s - even when manually selecting the LWS/GWS autotune picked for that single time faster run. Really weird. And no I am not forgetting the kernel cache clean.
As for caching, it appears we no longer need kernel-cache-clean
after changing included files on the nvidias (Linux). I'm using 565.57.01 but I've had a feeling this happened a couple of versions ago. That's great, although I don't trust it yet.
This was during a
--test=0 --format=opencl
(edit: and a 2080ti) with driver 550.107.02. I killed it after 45 minutes with 100% CPU. Did we see this before?Also tried
-test=0 -form:tezos-opencl
by itself, same thing but I interrupted sooner. Will try other drivers (as long as they are readily available as Ubuntu packages).