Closed quangIO closed 1 year ago
Thanks for reporting.
I didn't find anyone else complaining about this format (with NIVDIA). Could you tell us your version of the CUDA drivers?
Anyway, it looks like something in your version of CUDA is behaving badly. We'll investigate.
I didn't find anyone else complaining about this format (with NIVDIA).
Not on NVIDIA, but similar-looking failure (also at cmp_all(5)
) was seen on AMD R9 390 with amdgpu-pro-20.10-1084971-ubuntu-18.04 or amdgpu-pro-20.40-1147286-ubuntu-20.04:
https://github.com/openwall/john/issues/3610#issuecomment-982652301
I am using nvidia-open
driver with CUDA 12.2 on Arch Linux. Trying to use the nvidia
driver didn't fix the issue either.
❯ nvidia-smi
Thu Jul 13 23:40:04 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
535.54.03 sounds very recent, probably more recent than we ever tested so far. It could very well have introduced a new bug or revealed a "new" bug of ours. Can you please try the below command? -
./john --test --device=1 --format=raw-sha512-opencl,bitcoin-opencl,pbkdf2-hmac-sha512,sha512crypt-opencl
Test vector 5 of pfx-opencl
appears to be based on SHA-512, so I'm wondering it we possibly have a miscompile of SHA-512 that would also occur in other kernels.
Device 1@linux.local: NVIDIA GeForce RTX 4070 Laptop GPU
Benchmarking: raw-SHA512-opencl [SHA512 OpenCL/mask accel]... LWS=256 GWS=9216 (36 blocks) x9500 DONE
Raw: 1244M c/s real, 1244M c/s virtual, Dev#1 util: 100%
Benchmarking: Bitcoin-opencl, Bitcoin Core [SHA512 AES OpenCL]... LWS=256 GWS=36864 (144 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw: 6467 c/s real, 35108 c/s virtual, Dev#1 util: 100%
Benchmarking: PBKDF2-HMAC-SHA512, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 128/128 AVX 2x]... (20xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw: 17600 c/s real, 915 c/s virtual
Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... LWS=256 GWS=9216 (36 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw: 220083 c/s real, 220083 c/s virtual, Dev#1 util: 99%
4 formats benchmarked.
It seems they are working just fine (?)
It seems they are working just fine (?)
Yes. Can you please also test geli-opencl
?
Separately, please try editing opencl/opencl_misc.h
line 324, which says:
#define ALLOW_ALIASING_VIOLATIONS 1
Change 1 to 0 on that line. Then rm -rf ~/.nv/ComputeCache
and re-test pfx-opencl
.
Reproduced what looks similar with old Intel OpenCL. In that case, the misbehavior is in hmac_sha512
. Found and fixed several bugs in hmac_sha512
(including a potential out of bounds write!), but that didn't make a difference for me. Since I don't know if the issue seen by @quangIO is actually the same but or not (maybe it just manifests itself similarly, but is a different bug), those fixes I already have might help here.
diff --git a/run/opencl/opencl_hmac_sha512.h b/run/opencl/opencl_hmac_sha512.h
index 4ad08ef..37b69b1 100644
--- a/run/opencl/opencl_hmac_sha512.h
+++ b/run/opencl/opencl_hmac_sha512.h
@@ -31,8 +31,10 @@ inline void hmac_sha512(HMAC_KEY_TYPE void *_key, uint key_len,
HMAC_KEY_TYPE uchar *key = _key;
HMAC_MSG_TYPE uchar *data = _data;
HMAC_OUT_TYPE uchar *digest = _digest;
- ulong pW[16];
- uchar *buf = (uchar*)pW;
+ union {
+ ulong pW[16];
+ uchar buf[128];
+ } u;
uchar local_digest[64];
SHA512_CTX ctx;
uint i;
@@ -53,40 +55,40 @@ inline void hmac_sha512(HMAC_KEY_TYPE void *_key, uint key_len,
#else
SHA512_Update(&ctx, key, key_len);
#endif
- SHA512_Final(buf, &ctx);
- pW[0] ^= 0x3636363636363636UL;
- pW[1] ^= 0x3636363636363636UL;
- pW[2] ^= 0x3636363636363636UL;
- pW[3] ^= 0x3636363636363636UL;
- pW[4] ^= 0x3636363636363636UL;
- pW[5] ^= 0x3636363636363636UL;
- pW[6] ^= 0x3636363636363636UL;
- pW[7] ^= 0x3636363636363636UL;
- memset_p(&buf[64], 0x36, 128 - 64);
+ SHA512_Final(u.buf, &ctx);
+ u.pW[0] ^= 0x3636363636363636UL;
+ u.pW[1] ^= 0x3636363636363636UL;
+ u.pW[2] ^= 0x3636363636363636UL;
+ u.pW[3] ^= 0x3636363636363636UL;
+ u.pW[4] ^= 0x3636363636363636UL;
+ u.pW[5] ^= 0x3636363636363636UL;
+ u.pW[6] ^= 0x3636363636363636UL;
+ u.pW[7] ^= 0x3636363636363636UL;
+ memset_p(&u.buf[64], 0x36, 128 - 64);
} else
#endif
{
- memcpy_macro(buf, key, key_len);
- memset_p(&buf[key_len], 0, 128 - key_len);
+ memcpy_macro(u.buf, key, key_len);
+ memset_p(&u.buf[key_len], 0, 128 - key_len);
for (i = 0; i < 16; i++)
- pW[i] ^= 0x3636363636363636UL;
+ u.pW[i] ^= 0x3636363636363636UL;
}
SHA512_Init(&ctx);
- SHA512_Update(&ctx, buf, 128);
+ SHA512_Update(&ctx, u.buf, 128);
#ifdef USE_DATA_BUF
- HMAC_MSG_TYPE ulong *data32 = (HMAC_MSG_TYPE ulong*)_data;
- ulong blocks = data_len / 128;
+ HMAC_MSG_TYPE ulong *data64 = (HMAC_MSG_TYPE ulong*)_data;
+ uint blocks = data_len / 128;
data_len -= 128 * blocks;
data += 128 * blocks;
ctx.total += 128 * blocks;
while (blocks--) {
ulong W[16];
for (i = 0; i < 16; i++)
- W[i] = SWAP64(data32[i]);
+ W[i] = SWAP64(data64[i]);
sha512_block(W, ctx.state);
- data32 += 16;
+ data64 += 16;
}
- uchar pbuf[64];
+ uchar pbuf[128];
memcpy_macro(pbuf, data, data_len);
SHA512_Update(&ctx, pbuf, data_len);
#else
@@ -94,9 +96,9 @@ inline void hmac_sha512(HMAC_KEY_TYPE void *_key, uint key_len,
#endif
SHA512_Final(local_digest, &ctx);
for (i = 0; i < 16; i++)
- pW[i] ^= (0x3636363636363636UL ^ 0x5c5c5c5c5c5c5c5cUL);
+ u.pW[i] ^= (0x3636363636363636UL ^ 0x5c5c5c5c5c5c5c5cUL);
SHA512_Init(&ctx);
- SHA512_Update(&ctx, buf, 128);
+ SHA512_Update(&ctx, u.buf, 128);
SHA512_Update(&ctx, local_digest, 64);
SHA512_Final(local_digest, &ctx);
diff --git a/run/opencl/pfx_kernel.cl b/run/opencl/pfx_kernel.cl
index 0ca0182..da2fb4a 100644
--- a/run/opencl/pfx_kernel.cl
+++ b/run/opencl/pfx_kernel.cl
@@ -42,7 +42,10 @@ typedef struct {
uint32_t saltlen;
uint32_t salt[20 / 4];
uint32_t datalen;
- uint32_t data[MAX_DATA_LENGTH / 4];
+ union {
+ uint u32[MAX_DATA_LENGTH / 4]; /* Same type as hmac_sha1() and hmac_sha256() use */
+ ulong u64[MAX_DATA_LENGTH / 8]; /* Same type as hmac_sha512() uses */
+ } data;
} pfx_salt;
inline void pfx_crypt(__global const uint *password, uint32_t password_length,
@@ -51,7 +54,7 @@ inline void pfx_crypt(__global const uint *password, uint32_t password_length,
uint i;
uint32_t ckey[64 / 4];
uint32_t csalt[20 / 4];
- uint32_t cpassword[(PLAINTEXT_LENGTH + 1 + 3) / 4];
+ uint32_t cpassword[(PLAINTEXT_LENGTH + 3) / 4];
for (i = 0; i < (password_length + 3) / 4; i++)
cpassword[i] = password[i];
@@ -63,19 +66,23 @@ inline void pfx_crypt(__global const uint *password, uint32_t password_length,
case 1:
pkcs12_pbe_derive_key(salt->iterations, 3, cpassword, password_length,
csalt, salt->saltlen, ckey, salt->keylen);
- hmac_sha1(ckey, salt->keylen, salt->data, salt->datalen, out, 20);
+ hmac_sha1(ckey, salt->keylen, salt->data.u32, salt->datalen, out, 20);
break;
case 256:
pkcs12_pbe_derive_key_sha256(salt->iterations, 3, cpassword,
password_length, csalt, salt->saltlen,
ckey, salt->keylen);
- hmac_sha256(ckey, salt->keylen, salt->data, salt->datalen, out, 20);
+ hmac_sha256(ckey, salt->keylen, salt->data.u32, salt->datalen, out, 20);
break;
case 512:
pkcs12_pbe_derive_key_sha512(salt->iterations, 3, cpassword,
password_length, csalt, salt->saltlen,
ckey, salt->keylen);
- hmac_sha512(ckey, salt->keylen, salt->data, salt->datalen, out, 20);
+printf("ckey = %08x\n", ckey[0]);
+ hmac_sha512(ckey, salt->keylen, salt->data.u64, salt->datalen, out, 20);
+printf("out = %08x\n", out[0]);
+ hmac_sha512(ckey, salt->keylen, salt->data.u64, salt->datalen, out, 20);
+printf("out = %08x\n", out[0]);
break;
}
}
Here, the change in size of pbuf
is the important fix (64 was apparently leftover from copy-paste from the corresponding SHA-1 and SHA-256 code). Most of the remaining changes are to avoid aliasing violations (but SHA512_Final
would still have some on NVIDIA by default). We'll probably want to also make similar changes to avoid aliasing violations also in SHA-1 and SHA-256 HMACs.
Then there's debugging output and duplicate invocation of hmac_sha512
for testing - somehow for me with the old Intel OpenCL, repeated calls to that function with same inputs produce different results (could well be a bug in that OpenCL backend). On NVIDIA, this just passes tests for me, both before and after all of these changes.
Somehow copying salt->saltlen
into a local scope variable made this run correctly on the old Intel OpenCL. Probably a bug in the OpenCL backend, after all. But that doesn't mean we didn't have more bugs here.
Reproduced what looks similar with old Intel OpenCL.
That is interesting.
It passes self-test on CPU using [intel-site]/13793/l_opencl_p_18.1.0.013.tgz
driver. The latest, as far as I know.
Version: 1.9.0-jumbo-1+bleeding-71bbe16d64 2023-07-13 09:49:41 -0300
Build: linux-gnu 64-bit x86_64 AVX512BW AC OMP OPENCL
SIMD: AVX512BW, interleaving: MD4:3 MD5:3 SHA1:1 SHA256:1 SHA512:1
[...]
[...]
Testing: pfx-opencl, (.pfx, .p12) [PKCS#12 PBE (SHA1/SHA-256/512) OpenCL]... Build log: Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <pfx> was successfully vectorized (16)
Done.
PASS
@quangIO Can you please try the fixes from #5339?
git clone https://github.com/solardiz/john
git checkout fixes-20230713
Thanks @solardiz for the quick response. Your patch fixed the issue for me!
@quangIO Thank you for testing. I've just merged the patch. We'd appreciate it if you also run an all-formats test with the latest code and report any issues, e.g. ./john --test=0 --format=opencl --device=1
or even ./john --test --device=1
(would also include CPU formats and benchmarks).
I'll close this issue now, but please feel free to add relevant comments. If further testing reveals any issues with other formats, please open a separate issue (or several, if they look unrelated). Thank you!
When trying to use pfx-opencl to crack p12 files, I always see self-test
cmp_all(5)
error.Running:
john --test=0 --format=pfx-opencl --device=1
returns