pfx-opencl self test failed with (cmp_all(5))

quangIO commented 1 year ago

When trying to use pfx-opencl to crack p12 files, I always see self-test cmp_all(5) error.

Running: john --test=0 --format=pfx-opencl --device=1 returns

Device 1: NVIDIA GeForce RTX XXXX Laptop GPU
Testing: pfx-opencl, (.pfx, .p12) [PKCS#12 PBE (SHA1/SHA-256/512) OpenCL]... FAILED (cmp_all(5))

❯ john --list=build-info
Version: 1.9.0-jumbo-1+bleeding-b548cd514b 2023-07-11 19:45:38 +0200
Build: linux-gnu 64-bit x86_64 AVX AC MPI + OMP OPENCL
SIMD: AVX, interleaving: MD4:3 MD5:3 SHA1:1 SHA256:1 SHA512:1
System-wide exec: /usr/bin
System-wide home: /usr/share/john
Private home: ~/.john
CPU tests: AVX
CPU fallback binary: john-non-avx
$JOHN is /usr/share/john/
Format interface version: 14
Max. number of reported tunable costs: 4
Rec file version: REC4
Charset file version: CHR3
CHARSET_MIN: 1 (0x01)
CHARSET_MAX: 255 (0xff)
CHARSET_LENGTH: 24
SALT_HASH_SIZE: 1048576
SINGLE_IDX_MAX: 2147483648
SINGLE_BUF_MAX: 4294967295
Effective limit: Number of salts vs. SingleMaxBufferSize
Max. Markov mode level: 400
Max. Markov mode password length: 30
gcc version: 13.1.1
GNU libc version: 2.37 (loaded: 2.37)
OpenCL headers version: 1.2
Crypto library: OpenSSL
OpenSSL library version: 030100010
OpenSSL 3.1.1 30 May 2023
GMP library version: 6.2.1
File locking: fcntl()
fseek(): fseek
ftell(): ftell
fopen(): fopen
memmem(): System's
times(2) sysconf(_SC_CLK_TCK) is 100
Using times(2) for timers, resolution 10 ms
HR timer: clock_gettime(), latency 59 ns
Total physical host memory: 31709 MiB
Available physical host memory: 27026 MiB
Terminal locale string: en_US.UTF-8
Parsed terminal locale: UTF-8

claudioandre-br commented 1 year ago

Thanks for reporting.

I didn't find anyone else complaining about this format (with NIVDIA). Could you tell us your version of the CUDA drivers?

Anyway, it looks like something in your version of CUDA is behaving badly. We'll investigate.

solardiz commented 1 year ago

I didn't find anyone else complaining about this format (with NIVDIA).

Not on NVIDIA, but similar-looking failure (also at cmp_all(5)) was seen on AMD R9 390 with amdgpu-pro-20.10-1084971-ubuntu-18.04 or amdgpu-pro-20.40-1147286-ubuntu-20.04: https://github.com/openwall/john/issues/3610#issuecomment-982652301

quangIO commented 1 year ago

I am using nvidia-open driver with CUDA 12.2 on Arch Linux. Trying to use the nvidia driver didn't fix the issue either.

❯ nvidia-smi
Thu Jul 13 23:40:04 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |

solardiz commented 1 year ago

535.54.03 sounds very recent, probably more recent than we ever tested so far. It could very well have introduced a new bug or revealed a "new" bug of ours. Can you please try the below command? -

./john --test --device=1 --format=raw-sha512-opencl,bitcoin-opencl,pbkdf2-hmac-sha512,sha512crypt-opencl

Test vector 5 of pfx-opencl appears to be based on SHA-512, so I'm wondering it we possibly have a miscompile of SHA-512 that would also occur in other kernels.

quangIO commented 1 year ago

Device 1@linux.local: NVIDIA GeForce RTX 4070 Laptop GPU
Benchmarking: raw-SHA512-opencl [SHA512 OpenCL/mask accel]... LWS=256 GWS=9216 (36 blocks) x9500 DONE
Raw:    1244M c/s real, 1244M c/s virtual, Dev#1 util: 100%

Benchmarking: Bitcoin-opencl, Bitcoin Core [SHA512 AES OpenCL]... LWS=256 GWS=36864 (144 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    6467 c/s real, 35108 c/s virtual, Dev#1 util: 100%

Benchmarking: PBKDF2-HMAC-SHA512, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 128/128 AVX 2x]... (20xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw:    17600 c/s real, 915 c/s virtual

Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... LWS=256 GWS=9216 (36 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    220083 c/s real, 220083 c/s virtual, Dev#1 util: 99%

4 formats benchmarked.

It seems they are working just fine (?)

solardiz commented 1 year ago

It seems they are working just fine (?)

Yes. Can you please also test geli-opencl?

Separately, please try editing opencl/opencl_misc.h line 324, which says:

#define ALLOW_ALIASING_VIOLATIONS       1

Change 1 to 0 on that line. Then rm -rf ~/.nv/ComputeCache and re-test pfx-opencl.

solardiz commented 1 year ago

Reproduced what looks similar with old Intel OpenCL. In that case, the misbehavior is in hmac_sha512. Found and fixed several bugs in hmac_sha512 (including a potential out of bounds write!), but that didn't make a difference for me. Since I don't know if the issue seen by @quangIO is actually the same but or not (maybe it just manifests itself similarly, but is a different bug), those fixes I already have might help here.

diff --git a/run/opencl/opencl_hmac_sha512.h b/run/opencl/opencl_hmac_sha512.h
index 4ad08ef..37b69b1 100644
--- a/run/opencl/opencl_hmac_sha512.h
+++ b/run/opencl/opencl_hmac_sha512.h
@@ -31,8 +31,10 @@ inline void hmac_sha512(HMAC_KEY_TYPE void *_key, uint key_len,
        HMAC_KEY_TYPE uchar *key = _key;
        HMAC_MSG_TYPE uchar *data = _data;
        HMAC_OUT_TYPE uchar *digest = _digest;
-       ulong pW[16];
-       uchar *buf = (uchar*)pW;
+       union {
+               ulong pW[16];
+               uchar buf[128];
+       } u;
        uchar local_digest[64];
        SHA512_CTX ctx;
        uint i;
@@ -53,40 +55,40 @@ inline void hmac_sha512(HMAC_KEY_TYPE void *_key, uint key_len,
 #else
                SHA512_Update(&ctx, key, key_len);
 #endif
-               SHA512_Final(buf, &ctx);
-               pW[0] ^= 0x3636363636363636UL;
-               pW[1] ^= 0x3636363636363636UL;
-               pW[2] ^= 0x3636363636363636UL;
-               pW[3] ^= 0x3636363636363636UL;
-               pW[4] ^= 0x3636363636363636UL;
-               pW[5] ^= 0x3636363636363636UL;
-               pW[6] ^= 0x3636363636363636UL;
-               pW[7] ^= 0x3636363636363636UL;
-               memset_p(&buf[64], 0x36, 128 - 64);
+               SHA512_Final(u.buf, &ctx);
+               u.pW[0] ^= 0x3636363636363636UL;
+               u.pW[1] ^= 0x3636363636363636UL;
+               u.pW[2] ^= 0x3636363636363636UL;
+               u.pW[3] ^= 0x3636363636363636UL;
+               u.pW[4] ^= 0x3636363636363636UL;
+               u.pW[5] ^= 0x3636363636363636UL;
+               u.pW[6] ^= 0x3636363636363636UL;
+               u.pW[7] ^= 0x3636363636363636UL;
+               memset_p(&u.buf[64], 0x36, 128 - 64);
        } else
 #endif
        {
-               memcpy_macro(buf, key, key_len);
-               memset_p(&buf[key_len], 0, 128 - key_len);
+               memcpy_macro(u.buf, key, key_len);
+               memset_p(&u.buf[key_len], 0, 128 - key_len);
                for (i = 0; i < 16; i++)
-                       pW[i] ^= 0x3636363636363636UL;
+                       u.pW[i] ^= 0x3636363636363636UL;
        }
        SHA512_Init(&ctx);
-       SHA512_Update(&ctx, buf, 128);
+       SHA512_Update(&ctx, u.buf, 128);
 #ifdef USE_DATA_BUF
-       HMAC_MSG_TYPE ulong *data32 = (HMAC_MSG_TYPE ulong*)_data;
-       ulong blocks = data_len / 128;
+       HMAC_MSG_TYPE ulong *data64 = (HMAC_MSG_TYPE ulong*)_data;
+       uint blocks = data_len / 128;
        data_len -= 128 * blocks;
        data += 128 * blocks;
        ctx.total += 128 * blocks;
        while (blocks--) {
                ulong W[16];
                for (i = 0; i < 16; i++)
-                       W[i] = SWAP64(data32[i]);
+                       W[i] = SWAP64(data64[i]);
                sha512_block(W, ctx.state);
-               data32 += 16;
+               data64 += 16;
        }
-       uchar pbuf[64];
+       uchar pbuf[128];
        memcpy_macro(pbuf, data, data_len);
        SHA512_Update(&ctx, pbuf, data_len);
 #else
@@ -94,9 +96,9 @@ inline void hmac_sha512(HMAC_KEY_TYPE void *_key, uint key_len,
 #endif
        SHA512_Final(local_digest, &ctx);
        for (i = 0; i < 16; i++)
-               pW[i] ^= (0x3636363636363636UL ^ 0x5c5c5c5c5c5c5c5cUL);
+               u.pW[i] ^= (0x3636363636363636UL ^ 0x5c5c5c5c5c5c5c5cUL);
        SHA512_Init(&ctx);
-       SHA512_Update(&ctx, buf, 128);
+       SHA512_Update(&ctx, u.buf, 128);
        SHA512_Update(&ctx, local_digest, 64);
        SHA512_Final(local_digest, &ctx);

diff --git a/run/opencl/pfx_kernel.cl b/run/opencl/pfx_kernel.cl
index 0ca0182..da2fb4a 100644
--- a/run/opencl/pfx_kernel.cl
+++ b/run/opencl/pfx_kernel.cl
@@ -42,7 +42,10 @@ typedef struct {
        uint32_t saltlen;
        uint32_t salt[20 / 4];
        uint32_t datalen;
-       uint32_t data[MAX_DATA_LENGTH / 4];
+       union {
+               uint u32[MAX_DATA_LENGTH / 4]; /* Same type as hmac_sha1() and hmac_sha256() use */
+               ulong u64[MAX_DATA_LENGTH / 8]; /* Same type as hmac_sha512() uses */
+       } data;
 } pfx_salt;

 inline void pfx_crypt(__global const uint *password, uint32_t password_length,
@@ -51,7 +54,7 @@ inline void pfx_crypt(__global const uint *password, uint32_t password_length,
        uint i;
        uint32_t ckey[64 / 4];
        uint32_t csalt[20 / 4];
-       uint32_t cpassword[(PLAINTEXT_LENGTH + 1 + 3) / 4];
+       uint32_t cpassword[(PLAINTEXT_LENGTH + 3) / 4];

        for (i = 0; i < (password_length + 3) / 4; i++)
                cpassword[i] = password[i];
@@ -63,19 +66,23 @@ inline void pfx_crypt(__global const uint *password, uint32_t password_length,
        case 1:
                pkcs12_pbe_derive_key(salt->iterations, 3, cpassword, password_length,
                                      csalt, salt->saltlen, ckey, salt->keylen);
-               hmac_sha1(ckey, salt->keylen, salt->data, salt->datalen, out, 20);
+               hmac_sha1(ckey, salt->keylen, salt->data.u32, salt->datalen, out, 20);
                break;
        case 256:
                pkcs12_pbe_derive_key_sha256(salt->iterations, 3, cpassword,
                                             password_length, csalt, salt->saltlen,
                                             ckey, salt->keylen);
-               hmac_sha256(ckey, salt->keylen, salt->data, salt->datalen, out, 20);
+               hmac_sha256(ckey, salt->keylen, salt->data.u32, salt->datalen, out, 20);
                break;
        case 512:
                pkcs12_pbe_derive_key_sha512(salt->iterations, 3, cpassword,
                                             password_length, csalt, salt->saltlen,
                                             ckey, salt->keylen);
-               hmac_sha512(ckey, salt->keylen, salt->data, salt->datalen, out, 20);
+printf("ckey = %08x\n", ckey[0]);
+               hmac_sha512(ckey, salt->keylen, salt->data.u64, salt->datalen, out, 20);
+printf("out = %08x\n", out[0]);
+               hmac_sha512(ckey, salt->keylen, salt->data.u64, salt->datalen, out, 20);
+printf("out = %08x\n", out[0]);
                break;
        }
 }

Here, the change in size of pbuf is the important fix (64 was apparently leftover from copy-paste from the corresponding SHA-1 and SHA-256 code). Most of the remaining changes are to avoid aliasing violations (but SHA512_Final would still have some on NVIDIA by default). We'll probably want to also make similar changes to avoid aliasing violations also in SHA-1 and SHA-256 HMACs.

Then there's debugging output and duplicate invocation of hmac_sha512 for testing - somehow for me with the old Intel OpenCL, repeated calls to that function with same inputs produce different results (could well be a bug in that OpenCL backend). On NVIDIA, this just passes tests for me, both before and after all of these changes.

solardiz commented 1 year ago

Somehow copying salt->saltlen into a local scope variable made this run correctly on the old Intel OpenCL. Probably a bug in the OpenCL backend, after all. But that doesn't mean we didn't have more bugs here.

claudioandre-br commented 1 year ago

Reproduced what looks similar with old Intel OpenCL.

That is interesting.

It passes self-test on CPU using [intel-site]/13793/l_opencl_p_18.1.0.013.tgz driver. The latest, as far as I know.

Version: 1.9.0-jumbo-1+bleeding-71bbe16d64 2023-07-13 09:49:41 -0300
Build: linux-gnu 64-bit x86_64 AVX512BW AC OMP OPENCL
SIMD: AVX512BW, interleaving: MD4:3 MD5:3 SHA1:1 SHA256:1 SHA512:1
[...]

[...]

Testing: pfx-opencl, (.pfx, .p12) [PKCS#12 PBE (SHA1/SHA-256/512) OpenCL]... Build log: Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <pfx> was successfully vectorized (16)
Done.
PASS

solardiz commented 1 year ago

@quangIO Can you please try the fixes from #5339?

git clone https://github.com/solardiz/john
git checkout fixes-20230713

quangIO commented 1 year ago

Thanks @solardiz for the quick response. Your patch fixed the issue for me!

solardiz commented 1 year ago

@quangIO Thank you for testing. I've just merged the patch. We'd appreciate it if you also run an all-formats test with the latest code and report any issues, e.g. ./john --test=0 --format=opencl --device=1 or even ./john --test --device=1 (would also include CPU formats and benchmarks).

solardiz commented 1 year ago

I'll close this issue now, but please feel free to add relevant comments. If further testing reveals any issues with other formats, please open a separate issue (or several, if they look unrelated). Thank you!

openwall / john

pfx-opencl self test failed with (cmp_all(5)) #5337