Closed solardiz closed 4 years ago
Just thinking out loud: When a format test (out of several) fails, such as at the first CL_OUT_OF_HOST_MEMORY, are we tearing that format down fully? Perhaps we're never calling the format's done()
when it bails?
Perhaps we're never calling the format's
done()
when it bails?
This seems to be the problem. I'm not quite sure where to put it (it might be needed in several places) nor if there may be dragons (if format is already in limbo, trying to tear it down may well go downhill).
Here's the most obvious one:
diff --git a/src/formats.c b/src/formats.c
index ec66c6b74..9fde6af4f 100644
--- a/src/formats.c
+++ b/src/formats.c
@@ -1646,6 +1646,9 @@ char *fmt_self_test(struct fmt_main *format, struct db_main *db)
retval = fmt_self_test_body(format, binary_copy, salt_copy, db, benchmark_level);
+ if (retval && format->methods.done)
+ format->methods.done();
+
self_test_running = 0;
MEM_FREE(salt_alloc);
Thanks, magnum. While this is probably something we need to fix, I doubt it's what caused the specific failure I reported. In that run, the very first failing format was gpg-opencl
, and it failed with CL_OUT_OF_HOST_MEMORY
, so memory must have leaked before that point, which means the leak wasn't connected to any format failing test (which I guess is what you meant by "it bails").
I sort of reproduced this on "super" by testing all OpenCL formats on the Titan X. This started to fail very close to the end:
Benchmarking: XSHA512-opencl, Mac OS X 10.7 salted [SHA512 OpenCL/mask accel]... LWS=128 GWS=3072 (24 blocks) x9500 0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I
[...]
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I
DONE
Warning: "Many salts" test limited: 205/256
Many salts: 314880 c/s real, 310226 c/s virtual
Only one salt: 311785 c/s real, 311785 c/s virtual
[New Thread 0x7ffae21fc700 (LWP 20632)]
[New Thread 0x7ffae17fb700 (LWP 20633)]
[New Thread 0x7ffae0dfa700 (LWP 20634)]
[New Thread 0x7ffad7fff700 (LWP 20635)]
[New Thread 0x7ffad75fe700 (LWP 20636)]
[New Thread 0x7ffad6bfd700 (LWP 20637)]
[New Thread 0x7ffad61fc700 (LWP 20638)]
Benchmarking: zed-opencl, Prim'X Zed! encrypted archives [PKCS#12 PBE (SHA1/SHA256) OpenCL]... LWS=32 GWS=6144 (192 blocks) DONE
Speed for cost 1 (iteration count) of 200000, cost 2 (hash-func [21:SHA1 22:SHA256]) of 22
Raw: 7554 c/s real, 7462 c/s virtual, Dev#5 util: 100%
[New Thread 0x7ffad57fb700 (LWP 20657)]
[New Thread 0x7ffad4dfa700 (LWP 20658)]
[Thread 0x7ffad4dfa700 (LWP 20658) exited]
[Thread 0x7ffad57fb700 (LWP 20657) exited]
0: Error creating context for device 4 (2:1): CL_OUT_OF_HOST_MEMORY (-6), giving up
The above was run under gdb
, but I think this didn't affect anything (except for the extra output about threads).
I managed to notice and run nvidia-smi
when I saw the many CL_MEM_OBJECT_ALLOCATION_FAILURE
messages above. It said:
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 00000000:84:00.0 Off | N/A |
| 30% 63C P2 91W / 250W | 12196MiB / 12212MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 19946 C ...olar/j/bleeding-jumbo-20200826/run/john 12165MiB |
+-----------------------------------------------------------------------------+
I also ran free
almost right away, which said:
[solar@super src]$ free
total used free shared buffers cached
Mem: 132264592 53486488 78778104 410100 420736 43574132
-/+ buffers/cache: 9491620 122772972
Swap: 0 0 0
So it looks like we were actually running out of GPU memory, not host memory.
I guess the failure happened sooner on V100 because of much higher GWS figures.
BTW, "Error creating context for device 4 (2:1)" probably means we forgot to update this (and a nearby) message to 1-based device numbers, right? We probably should. Also, the "0: " is node number, which we'd normally not report when there's only one node.
Reproduced the same crash at the same format when benchmarking all formats (CPU and OpenCL). So CPU format benchmarks don't play a role here. Also re-confirmed that only GPU memory was exhausted - 12 GB on GPU vs. ~9.5 GB on host (~7.5 GB by john
process), again. Both GPU and host memory usage grew almost steadily with each OpenCL format benchmarked, so the leak doesn't appear to be specific to just a handful of formats.
Both GPU and host memory usage grew almost steadily with each OpenCL format benchmarked, so the leak doesn't appear to be specific to just a handful of formats.
Interesting. This should be a relatively new bug - I'm pretty sure I've checked for things like this in the past.
The memory leak is present even when running with fixed LWS/GWS (so not in auto-tune?) and even with tiny GWS:
$ LWS=64 GWS=64 ./john -test -form=opencl
[...]
Device 1: GeForce GTX 570
Benchmarking: sha1crypt-opencl, (NetBSD) [PBKDF1-SHA1 OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 64000 and 40000
Raw: 141 c/s real, 140 c/s virtual
Benchmarking: KeePass-opencl [SHA256 AES/Twofish/ChaCha OpenCL]... 0: OpenCL CL_OUT_OF_RESOURCES (-5) error in opencl_keepass_fmt_plug.c:275 - Copy result back
FAILED (cmp_all(-1))
Benchmarking: oldoffice-opencl, MS Office <= 2003 [MD5/SHA1 RC4 OpenCL/mask accel]... LWS=64 GWS=64 (1 blocks) x95 DONE
Speed for cost 1 (hash type) of 1 and 0
Raw: 547502 c/s real, 547502 c/s virtual
Benchmarking: PBKDF2-HMAC-MD4-opencl [PBKDF2-MD4 OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iterations) of 1000
Raw: 10252 c/s real, 10252 c/s virtual
Benchmarking: PBKDF2-HMAC-MD5-opencl [PBKDF2-MD5 OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iterations) of 1000
Raw: 6176 c/s real, 6176 c/s virtual
Benchmarking: PBKDF2-HMAC-SHA1-opencl [PBKDF2-SHA1 OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iterations) of 1000
Raw: 6941 c/s real, 6906 c/s virtual
Benchmarking: rar-opencl, RAR3 (length 5) [SHA1 OpenCL AES]... (8xOMP) LWS=64 GWS=64 (1 blocks) DONE
Raw: 119 c/s real, 108 c/s virtual
Benchmarking: RAR5-opencl [PBKDF2-SHA256 OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 32768
Raw: 64.6 c/s real, 64.6 c/s virtual
Benchmarking: TrueCrypt-opencl [RIPEMD160 AES256_XTS OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Raw: 604 c/s real, 601 c/s virtual
Benchmarking: lotus5-opencl, Lotus Notes/Domino 5 [OpenCL]... 0: OpenCL CL_OUT_OF_RESOURCES (-5) error in opencl_lotus5_fmt_plug.c:257 - Failed to read buffer cl_tx_binary.
FAILED (cmp_all(-1))
Benchmarking: AndroidBackup-opencl [PBKDF2-SHA1 AES OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 10000
Raw: 352 c/s real, 350 c/s virtual
Benchmarking: agilekeychain-opencl, 1Password Agile Keychain [PBKDF2-SHA1 AES OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 1000
Raw: 7068 c/s real, 7033 c/s virtual
Benchmarking: ansible-opencl, Ansible Vault [PBKDF2-SHA256 HMAC-SHA256 OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 10000
Raw: 212 c/s real, 211 c/s virtual
Benchmarking: axcrypt-opencl [SHA1 AES OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 1337 and 60000
Raw: 51.8 c/s real, 51.8 c/s virtual
Benchmarking: axcrypt2-opencl, AxCrypt 2.x [PBKDF2-SHA512 AES OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 28200 and 23652
Raw: 13.2 c/s real, 13.1 c/s virtual
LWS=8 GWS=64 Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw: 686 c/s real, 686 c/s virtual
Benchmarking: BitLocker-opencl, BitLocker [SHA256 AES OpenCL]... 0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_bitlocker_fmt_plug.c:395 - Run kernel
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_bitlocker_fmt_plug.c:479 - clEnqueueWriteBuffer
FAILED (cmp_all(-1))
Benchmarking: bitwarden-opencl, Bitwarden Password Manager [PBKDF2-SHA256 AES OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw: 422 c/s real, 420 c/s virtual
Benchmarking: blockchain-opencl, blockchain My Wallet (v2 x5000) [PBKDF2-SHA1 AES OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Raw: 718 c/s real, 711 c/s virtual
0: Error creating context for device 0 (0:0): CL_OUT_OF_HOST_MEMORY (-6), giving up
(on the ancient GPU to have this exhaust its memory sooner)
I was watching nvidia-smi
as the above ran, and GPU memory usage was steadily growing and was very close to the maximum available at the time of the final failure.
Separately, also puzzling are the failures of KeePass-opencl and lotus5-opencl. These occur even when those formats are benchmarked individually, so are unrelated to the memory leak.
Edit: when running on this device with auto-tune enabled (and thus much higher GWS and higher speeds), the failures look the same - even the memory is finally exhausted at the same format. Benchmarking BitLocker-opencl on its own takes ages, but eventually succeeds, so the failure above is because of prior memory leaks.
The problem is fully seen at this commit:
commit bd7b8267764bfda5bce9df40cf1a6dfcf89e0329
Author: magnum <john.magnum@hushmail.com>
AuthorDate: Wed Jul 17 14:57:05 2019 +0200
Commit: magnum <john.magnum@hushmail.com>
CommitDate: Sun Sep 1 23:58:50 2019 +0200
Unify the reset() functions in all OpenCL formats that use shared
auto-tune, moving any heuristics to shared code. Closes #4012
It is mostly gone when I revert further, to the immediately preceding commit:
commit 30b1ff74dbea9d8fac0b2f95c3800b277cbc1b81
Author: magnum <john.magnum@hushmail.com>
AuthorDate: Fri Apr 5 12:04:39 2019 +0200
Commit: magnum <john.magnum@hushmail.com>
CommitDate: Sun Sep 1 23:58:50 2019 +0200
Benchmark using mask mode (and possibly GPU-side acceleration), unless
new option --no-mask is given along with --test. Closes #3697
bd7b8267764bfda5bce9df40cf1a6dfcf89e0329 changes implementations of done()
to only free things if (program[gpu_id])
, but I see nothing in the corresponding implementations of reset()
, which allocates things, that would guarantee that program[gpu_id]
is set. Prior to this commit, the variable autotuned
was used instead, and that one is obviously set by autotune_run()
, which is invoked from reset()
.
What's the rationale of changing from autotuned
to program[gpu_id]
in reset()
? Is that part correct or maybe also wrong?
@magnumripper Please commit on this.
bd7b826 changes implementations of
done()
to only free thingsif (program[gpu_id])
, but I see nothing in the corresponding implementations ofreset()
, which allocates things, that would guarantee thatprogram[gpu_id]
is set.
Which format were you looking at? With 80 files "unified", it's possible I screwed up some.
The first file shown for that commit is opencl_7z_fmt_plug.c
, and picking that as example, the logic seems fine to me - do I miss something?.
Hmm it's kinda messy to follow, but opencl_init()
calls opencl_build_kernel()
and so on, ultimately calling clCreateProgramWithSource()
and setting program[gpu_id]
to the result.
Which format were you looking at?
They looked pretty much the same to me, so all or most. I'm fine using opencl_7z_fmt_plug.c
as an example. During testing, there's significant increase in device memory usage e.g. during RAR formats' benchmark; after this commit, that increase stays into the TrueCrypt benchmark, resulting in even higher memory usage there.
I just can't see why it would happen, if it wasn't happening before that commit. Is my logic flawed? The way I see it, we can't create/release resources without having the "program" initialized. And if we do have a program, things should be deallocated and program teared down in done()
.
After fixing #4351 (PR coming), this is easily reproduced with any single format and the --stress-test
option. Should be easy to throw in some debug prints and nail it. I can take it from here if you wish.
I can take it from here if you wish.
Please do. Thanks!
For now only looking at RAR format. There was a minor leak where the very first reset()
would cause allocations (of GWS 49) that was never released (due to reset being called once too much but we should allow that). That is not the main problem but should be fixed in all formats:
diff --git a/src/opencl_rar_fmt_plug.c b/src/opencl_rar_fmt_plug.c
index 820e1fcff..0a2ddacf9 100644
--- a/src/opencl_rar_fmt_plug.c
+++ b/src/opencl_rar_fmt_plug.c
@@ -278,7 +278,9 @@ static void init(struct fmt_main *_self)
static void reset(struct db_main *db)
{
- if (!program[gpu_id]) {
+ if (program[gpu_id])
+ release_clobj();
+ else {
char build_opts[64];
snprintf(build_opts, sizeof(build_opts), "-DPLAINTEXT_LENGTH=%u -DHASH_LOOPS=0x%x", PLAINTEXT_LENGTH, HASH_LOOPS);
After fixing that, we never allocate new buffers without releasing any existing ones, and we never create program/kernels/etc without having teared down any previous ones. I threw in debug counters and pointer checks to assert that. Still, there's this major memory leak.
Need to think. Here's what we create:
opencl_init
, from format's reset
)reset
)create_clobj
)I'm pretty darn sure all of the above is correctly teared down so is there anything more? @claudioandre-br ?
Note BTW I'm not yet sure that diff is correct for things #4012 addressed in case of mask mode. For my current bug hunting, it's fine.
I'm pretty darn sure all of the above is correctly teared down so is there anything more? @claudioandre-br ?
I will debug it today (using --stress-test
and sha256, fast and slow formats) to see if I find anything.
I can find a leak. Important leak. I would say it is related to how/when reset()
is called (a relatively recent change?).
But there is something more.
Okay, I can see that the author of the format (me) did something wrong. And that's it! Everything fine now.
FTR, I'm testing using (dies in iteration 30, approx.).
make -sj4 && LWS=1024 GWS=163840 $JtR --format=sha256crypt-opencl -dev=4 -stress-test --no-mask
I will:
Then, we will know if all formats need a fix.
Iteration 200 (edited, I ran up to 400) uses the same amount of memory as iteration number 1 uses.
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 Off | N/A |
| 0% 93C P2 70W / 180W | 315MiB / 8119MiB | 100% Default |
I'm still interested in discussing reset()
. How it works (it is a question, there are some blanks in my mind).
reset()
autotune() // memory allocated.
-> what happens here? [1]
reset() (the real one, I guess)
autotune() // memory allocated again.
magnum?
There is one leak left because I don't know what I can (or should) do in [1] above.
LOL I randomly picked RAR-opencl which was a very bad choice for finding problems in shared OpenCL code because it has some leak on its own, unrelated to this issue. I'll fix that and carry on with this.
I will continue my testing based on having #4354 merged in my tree.
I'm still interested in discussing
reset()
. How it works (it is a question, there are some blanks in my mind).reset() autotune() // memory allocated. -> what happens here? [1] reset() (the real one, I guess) autotune() // memory allocated again.
Here are all situations for calling reset()
I can think of right now:
FWIW, this is fixed. One (perhaps some) format(s) have its own open issue.
Error creating context for device 4 (2:1)" probably means we forgot to update this (and a nearby) message to 1-based device numbers, right?
Yes.
Error creating context for device 4 (2:1)" probably means we forgot to update this (and a nearby) message to 1-based device numbers, right?
Yes.
Will you take care of this, @claudioandre-br? If we don't fix this right away, then perhaps we should create an issue for it, or we'd forget again.
When
--test --format=opencl
on a p3.2xlarge AWS instance doesn't crash on #4294, it crashes later withCL_OUT_OF_HOST_MEMORY
:Notice how everything was OK until
gpg-opencl
test failed withCL_OUT_OF_HOST_MEMORY
and a bit later the entire run ended with that same error occurring apparently just before another format would be benchmarked.Benchmarking
gpg-opencl
on its own works fine:Maybe we have an OpenCL host memory leak across tests?
This VM has ~60 GB of host memory, and the GPU has 16 GB. I tried monitoring the usage of both while rerunning the tests, but bumped into #4294, so my tests never reached as far as the above while I was monitoring. The highest memory usage I saw was ~3 GB on host and ~8 GB on device (by the way, why so much?)