CL_OUT_OF_HOST_MEMORY with "--test --format=opencl"

solardiz commented 4 years ago

When --test --format=opencl on a p3.2xlarge AWS instance doesn't crash on #4294, it crashes later with CL_OUT_OF_HOST_MEMORY:

$ john -test -form=opencl | tee -a log-v100-opencl
[...]
Benchmarking: descrypt-opencl, traditional crypt(3) [DES OpenCL/mask accel]... LWS=128 GWS=65536 Note: Building per-salt kernels. This takes e.g. 2 hours for 4096 salts.
DONE
Warning: "Many salts" test limited: 16/256
Many salts:     1323M c/s real, 1317M c/s virtual
Only one salt:  1265M c/s real, 1252M c/s virtual

Benchmarking: diskcryptor-opencl, DiskCryptor [PBKDF2-SHA512 OpenCL]... (8xOMP) LWS=256 GWS=163840 (640 blocks) DONE
Speed for cost 1 (iteration count) of 1000
Raw:    319687 c/s real, 55586 c/s virtual

Benchmarking: diskcryptor-aes-opencl, DiskCryptor AES XTS (only) [PBKDF2-SHA512 AES OpenCL]... LWS=256 GWS=327680 (1280 blocks) DONE
Speed for cost 1 (iteration count) of 1000
Raw:    1019K c/s real, 976068 c/s virtual

Benchmarking: dmg-opencl, Apple DMG [PBKDF2-SHA1 3DES/AES OpenCL]... LWS=32 GWS=40960 (1280 blocks) DONE
Speed for cost 1 (iteration count) of 1000, cost 2 (version) of 2 and 1
Raw:    1854K c/s real, 1845K c/s virtual

Benchmarking: electrum-modern-opencl, Electrum Wallet 2.8+ [PBKDF2-SHA512 OpenCL]... (8xOMP) LWS=256 GWS=327680 (1280 blocks) DONE
Raw:    21480 c/s real, 2740 c/s virtual

Benchmarking: EncFS-opencl [PBKDF2-SHA1 AES OpenCL]... LWS=32 GWS=81920 (2560 blocks) DONE
Speed for cost 1 (iteration count) of 181474 and 181317
Raw:    18788 c/s real, 18703 c/s virtual

Benchmarking: enpass-opencl, Enpass Password Manager [PBKDF2-SHA1/SHA512 AES OpenCL]... LWS=256 GWS=81920 (320 blocks) DONE
Speed for cost 1 (Enpass version) of 5
Raw:    141852 c/s real, 140635 c/s virtual

Benchmarking: ethereum-opencl, Ethereum Wallet [PBKDF2-SHA256 Keccak OpenCL]... LWS=32 GWS=20480 (640 blocks) DONE
Speed for cost 1 (iteration count) of 262144 and 1024
Raw:    22260 c/s real, 22140 c/s virtual

Benchmarking: ethereum-presale-opencl, Ethereum Presale Wallet [PBKDF2-SHA256 AES Keccak OpenCL]... LWS=256 GWS=20480 (80 blocks) DONE
Speed for cost 1 (iteration count) of 2000
Many salts:     21217K c/s real, 21217K c/s virtual
Only one salt:  1304K c/s real, 1291K c/s virtual

Benchmarking: FVDE-opencl, FileVault 2 [PBKDF2-SHA256 AES OpenCL]... LWS=512 GWS=40960 (80 blocks) DONE
Speed for cost 1 (iteration count) of 41000 and 70400
Raw:    49152 c/s real, 48956 c/s virtual

Benchmarking: geli-opencl, FreeBSD GELI [PBKDF2-SHA512 AES OpenCL]... LWS=256 GWS=327680 (1280 blocks) DONE
Speed for cost 1 (iteration count) of 256 and 512
Raw:    2265K c/s real, 2048K c/s virtual

Benchmarking: gpg-opencl, OpenPGP / GnuPG Secret Key [SHA1/SHA2 OpenCL]... (8xOMP) 0: OpenCL CL_OUT_OF_HOST_MEMORY (-6) error in opencl_gpg_fmt_plug.c:297 - Run kernel
FAILED (cmp_all(-1))
Benchmarking: iwork-opencl, Apple iWork '09 or newer [PBKDF2-SHA1 AES OpenCL]... LWS=256 GWS=81920 (320 blocks) DONE
Speed for cost 1 (iteration count) of 100000
Raw:    68552 c/s real, 68266 c/s virtual

Benchmarking: keychain-opencl, Mac OS X Keychain [PBKDF2-SHA1 3DES OpenCL]... LWS=32 GWS=655360 (20480 blocks) DONE
Raw:    3181K c/s real, 2887K c/s virtual

0: Error creating context for device 0 (0:0): CL_OUT_OF_HOST_MEMORY (-6), giving up

Notice how everything was OK until gpg-opencl test failed with CL_OUT_OF_HOST_MEMORY and a bit later the entire run ended with that same error occurring apparently just before another format would be benchmarked.

Benchmarking gpg-opencl on its own works fine:

$ john -test -form=gpg-opencl
Will run 8 OpenMP threads
Device 1: Tesla V100-SXM2-16GB
Benchmarking: gpg-opencl, OpenPGP / GnuPG Secret Key [SHA1/SHA2 OpenCL]... (8xOMP) LWS=32 GWS=40960 (1280 blocks) DONE
Speed for cost 1 (s2k-count) of 65536, cost 2 (hash algorithm [2:SHA1 8:SHA256 10:SHA512]) of 2, cost 3 (cipher algorithm [1:IDEA 2:3DES 3:CAST5 4:Blowfish 7:AES128 8:AES192 9:AES256 10:Twofish 11:Camellia128 12:Camellia192 13:Camellia256]) of 3
Warning: "Many salts" test limited: 196/256
Many salts:     4014K c/s real, 656431 c/s virtual
Only one salt:  3729K c/s real, 658671 c/s virtual

Maybe we have an OpenCL host memory leak across tests?

This VM has ~60 GB of host memory, and the GPU has 16 GB. I tried monitoring the usage of both while rerunning the tests, but bumped into #4294, so my tests never reached as far as the above while I was monitoring. The highest memory usage I saw was ~3 GB on host and ~8 GB on device (by the way, why so much?)

magnumripper commented 4 years ago

Just thinking out loud: When a format test (out of several) fails, such as at the first CL_OUT_OF_HOST_MEMORY, are we tearing that format down fully? Perhaps we're never calling the format's done() when it bails?

magnumripper commented 4 years ago

Perhaps we're never calling the format's done() when it bails?

This seems to be the problem. I'm not quite sure where to put it (it might be needed in several places) nor if there may be dragons (if format is already in limbo, trying to tear it down may well go downhill).

Here's the most obvious one:

diff --git a/src/formats.c b/src/formats.c
index ec66c6b74..9fde6af4f 100644
--- a/src/formats.c
+++ b/src/formats.c
@@ -1646,6 +1646,9 @@ char *fmt_self_test(struct fmt_main *format, struct db_main *db)

        retval = fmt_self_test_body(format, binary_copy, salt_copy, db, benchmark_level);

+       if (retval && format->methods.done)
+               format->methods.done();
+
        self_test_running = 0;

        MEM_FREE(salt_alloc);

solardiz commented 4 years ago

Thanks, magnum. While this is probably something we need to fix, I doubt it's what caused the specific failure I reported. In that run, the very first failing format was gpg-opencl, and it failed with CL_OUT_OF_HOST_MEMORY, so memory must have leaked before that point, which means the leak wasn't connected to any format failing test (which I guess is what you meant by "it bails").

solardiz commented 4 years ago

I sort of reproduced this on "super" by testing all OpenCL formats on the Titan X. This started to fail very close to the end:

Benchmarking: XSHA512-opencl, Mac OS X 10.7 salted [SHA512 OpenCL/mask accel]... LWS=128 GWS=3072 (24 blocks) x9500 0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I  
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I
[...]
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:677 - failed in clEnqueueNDRangeKernel I
DONE
Warning: "Many salts" test limited: 205/256
Many salts:     314880 c/s real, 310226 c/s virtual
Only one salt:  311785 c/s real, 311785 c/s virtual

[New Thread 0x7ffae21fc700 (LWP 20632)]
[New Thread 0x7ffae17fb700 (LWP 20633)]
[New Thread 0x7ffae0dfa700 (LWP 20634)]
[New Thread 0x7ffad7fff700 (LWP 20635)]
[New Thread 0x7ffad75fe700 (LWP 20636)]
[New Thread 0x7ffad6bfd700 (LWP 20637)]
[New Thread 0x7ffad61fc700 (LWP 20638)]
Benchmarking: zed-opencl, Prim'X Zed! encrypted archives [PKCS#12 PBE (SHA1/SHA256) OpenCL]... LWS=32 GWS=6144 (192 blocks) DONE
Speed for cost 1 (iteration count) of 200000, cost 2 (hash-func [21:SHA1 22:SHA256]) of 22
Raw:    7554 c/s real, 7462 c/s virtual, Dev#5 util: 100%

[New Thread 0x7ffad57fb700 (LWP 20657)]
[New Thread 0x7ffad4dfa700 (LWP 20658)]
[Thread 0x7ffad4dfa700 (LWP 20658) exited]
[Thread 0x7ffad57fb700 (LWP 20657) exited]
0: Error creating context for device 4 (2:1): CL_OUT_OF_HOST_MEMORY (-6), giving up

The above was run under gdb, but I think this didn't affect anything (except for the extra output about threads).

I managed to notice and run nvidia-smi when I saw the many CL_MEM_OBJECT_ALLOCATION_FAILURE messages above. It said:

+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 00000000:84:00.0 Off |                  N/A |
| 30%   63C    P2    91W / 250W |  12196MiB / 12212MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    2     19946      C   ...olar/j/bleeding-jumbo-20200826/run/john 12165MiB |
+-----------------------------------------------------------------------------+

I also ran free almost right away, which said:

[solar@super src]$ free
             total       used       free     shared    buffers     cached
Mem:     132264592   53486488   78778104     410100     420736   43574132
-/+ buffers/cache:    9491620  122772972
Swap:            0          0          0

So it looks like we were actually running out of GPU memory, not host memory.

I guess the failure happened sooner on V100 because of much higher GWS figures.

BTW, "Error creating context for device 4 (2:1)" probably means we forgot to update this (and a nearby) message to 1-based device numbers, right? We probably should. Also, the "0: " is node number, which we'd normally not report when there's only one node.

solardiz commented 4 years ago

Reproduced the same crash at the same format when benchmarking all formats (CPU and OpenCL). So CPU format benchmarks don't play a role here. Also re-confirmed that only GPU memory was exhausted - 12 GB on GPU vs. ~9.5 GB on host (~7.5 GB by john process), again. Both GPU and host memory usage grew almost steadily with each OpenCL format benchmarked, so the leak doesn't appear to be specific to just a handful of formats.

magnumripper commented 4 years ago

Both GPU and host memory usage grew almost steadily with each OpenCL format benchmarked, so the leak doesn't appear to be specific to just a handful of formats.

Interesting. This should be a relatively new bug - I'm pretty sure I've checked for things like this in the past.

solardiz commented 4 years ago

The memory leak is present even when running with fixed LWS/GWS (so not in auto-tune?) and even with tiny GWS:

$ LWS=64 GWS=64 ./john -test -form=opencl
[...]
Device 1: GeForce GTX 570
Benchmarking: sha1crypt-opencl, (NetBSD) [PBKDF1-SHA1 OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 64000 and 40000
Raw:    141 c/s real, 140 c/s virtual

Benchmarking: KeePass-opencl [SHA256 AES/Twofish/ChaCha OpenCL]... 0: OpenCL CL_OUT_OF_RESOURCES (-5) error in opencl_keepass_fmt_plug.c:275 - Copy result back
FAILED (cmp_all(-1))
Benchmarking: oldoffice-opencl, MS Office <= 2003 [MD5/SHA1 RC4 OpenCL/mask accel]... LWS=64 GWS=64 (1 blocks) x95 DONE
Speed for cost 1 (hash type) of 1 and 0
Raw:    547502 c/s real, 547502 c/s virtual

Benchmarking: PBKDF2-HMAC-MD4-opencl [PBKDF2-MD4 OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iterations) of 1000
Raw:    10252 c/s real, 10252 c/s virtual

Benchmarking: PBKDF2-HMAC-MD5-opencl [PBKDF2-MD5 OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iterations) of 1000
Raw:    6176 c/s real, 6176 c/s virtual

Benchmarking: PBKDF2-HMAC-SHA1-opencl [PBKDF2-SHA1 OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iterations) of 1000
Raw:    6941 c/s real, 6906 c/s virtual

Benchmarking: rar-opencl, RAR3 (length 5) [SHA1 OpenCL AES]... (8xOMP) LWS=64 GWS=64 (1 blocks) DONE
Raw:    119 c/s real, 108 c/s virtual

Benchmarking: RAR5-opencl [PBKDF2-SHA256 OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 32768
Raw:    64.6 c/s real, 64.6 c/s virtual

Benchmarking: TrueCrypt-opencl [RIPEMD160 AES256_XTS OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Raw:    604 c/s real, 601 c/s virtual

Benchmarking: lotus5-opencl, Lotus Notes/Domino 5 [OpenCL]... 0: OpenCL CL_OUT_OF_RESOURCES (-5) error in opencl_lotus5_fmt_plug.c:257 - Failed to read buffer cl_tx_binary.
FAILED (cmp_all(-1))
Benchmarking: AndroidBackup-opencl [PBKDF2-SHA1 AES OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 10000
Raw:    352 c/s real, 350 c/s virtual

Benchmarking: agilekeychain-opencl, 1Password Agile Keychain [PBKDF2-SHA1 AES OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 1000
Raw:    7068 c/s real, 7033 c/s virtual

Benchmarking: ansible-opencl, Ansible Vault [PBKDF2-SHA256 HMAC-SHA256 OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 10000
Raw:    212 c/s real, 211 c/s virtual

Benchmarking: axcrypt-opencl [SHA1 AES OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 1337 and 60000
Raw:    51.8 c/s real, 51.8 c/s virtual

Benchmarking: axcrypt2-opencl, AxCrypt 2.x [PBKDF2-SHA512 AES OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 28200 and 23652
Raw:    13.2 c/s real, 13.1 c/s virtual

LWS=8 GWS=64 Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw:    686 c/s real, 686 c/s virtual

Benchmarking: BitLocker-opencl, BitLocker [SHA256 AES OpenCL]... 0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_bitlocker_fmt_plug.c:395 - Run kernel
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_bitlocker_fmt_plug.c:479 - clEnqueueWriteBuffer
FAILED (cmp_all(-1))
Benchmarking: bitwarden-opencl, Bitwarden Password Manager [PBKDF2-SHA256 AES OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    422 c/s real, 420 c/s virtual

Benchmarking: blockchain-opencl, blockchain My Wallet (v2 x5000) [PBKDF2-SHA1 AES OpenCL]... LWS=64 GWS=64 (1 blocks) DONE
Raw:    718 c/s real, 711 c/s virtual

0: Error creating context for device 0 (0:0): CL_OUT_OF_HOST_MEMORY (-6), giving up

(on the ancient GPU to have this exhaust its memory sooner)

I was watching nvidia-smi as the above ran, and GPU memory usage was steadily growing and was very close to the maximum available at the time of the final failure.

Separately, also puzzling are the failures of KeePass-opencl and lotus5-opencl. These occur even when those formats are benchmarked individually, so are unrelated to the memory leak.

Edit: when running on this device with auto-tune enabled (and thus much higher GWS and higher speeds), the failures look the same - even the memory is finally exhausted at the same format. Benchmarking BitLocker-opencl on its own takes ages, but eventually succeeds, so the failure above is because of prior memory leaks.

solardiz commented 4 years ago

The problem is fully seen at this commit:

commit bd7b8267764bfda5bce9df40cf1a6dfcf89e0329
Author:     magnum <john.magnum@hushmail.com>
AuthorDate: Wed Jul 17 14:57:05 2019 +0200
Commit:     magnum <john.magnum@hushmail.com>
CommitDate: Sun Sep 1 23:58:50 2019 +0200

    Unify the reset() functions in all OpenCL formats that use shared
    auto-tune, moving any heuristics to shared code.  Closes #4012

It is mostly gone when I revert further, to the immediately preceding commit:

commit 30b1ff74dbea9d8fac0b2f95c3800b277cbc1b81
Author:     magnum <john.magnum@hushmail.com>
AuthorDate: Fri Apr 5 12:04:39 2019 +0200
Commit:     magnum <john.magnum@hushmail.com>
CommitDate: Sun Sep 1 23:58:50 2019 +0200

    Benchmark using mask mode (and possibly GPU-side acceleration), unless
    new option --no-mask is given along with --test.  Closes #3697

solardiz commented 4 years ago

bd7b8267764bfda5bce9df40cf1a6dfcf89e0329 changes implementations of done() to only free things if (program[gpu_id]), but I see nothing in the corresponding implementations of reset(), which allocates things, that would guarantee that program[gpu_id] is set. Prior to this commit, the variable autotuned was used instead, and that one is obviously set by autotune_run(), which is invoked from reset().

What's the rationale of changing from autotuned to program[gpu_id] in reset()? Is that part correct or maybe also wrong?

@magnumripper Please commit on this.

magnumripper commented 4 years ago

bd7b826 changes implementations of done() to only free things if (program[gpu_id]), but I see nothing in the corresponding implementations of reset(), which allocates things, that would guarantee that program[gpu_id] is set.

Which format were you looking at? With 80 files "unified", it's possible I screwed up some.

The first file shown for that commit is opencl_7z_fmt_plug.c, and picking that as example, the logic seems fine to me - do I miss something?.

magnumripper commented 4 years ago

Hmm it's kinda messy to follow, but opencl_init() calls opencl_build_kernel() and so on, ultimately calling clCreateProgramWithSource() and setting program[gpu_id] to the result.

solardiz commented 4 years ago

Which format were you looking at?

They looked pretty much the same to me, so all or most. I'm fine using opencl_7z_fmt_plug.c as an example. During testing, there's significant increase in device memory usage e.g. during RAR formats' benchmark; after this commit, that increase stays into the TrueCrypt benchmark, resulting in even higher memory usage there.

magnumripper commented 4 years ago

I just can't see why it would happen, if it wasn't happening before that commit. Is my logic flawed? The way I see it, we can't create/release resources without having the "program" initialized. And if we do have a program, things should be deallocated and program teared down in done().

magnumripper commented 4 years ago

After fixing #4351 (PR coming), this is easily reproduced with any single format and the --stress-test option. Should be easy to throw in some debug prints and nail it. I can take it from here if you wish.

solardiz commented 4 years ago

I can take it from here if you wish.

Please do. Thanks!

magnumripper commented 4 years ago

For now only looking at RAR format. There was a minor leak where the very first reset() would cause allocations (of GWS 49) that was never released (due to reset being called once too much but we should allow that). That is not the main problem but should be fixed in all formats:

diff --git a/src/opencl_rar_fmt_plug.c b/src/opencl_rar_fmt_plug.c
index 820e1fcff..0a2ddacf9 100644
--- a/src/opencl_rar_fmt_plug.c
+++ b/src/opencl_rar_fmt_plug.c
@@ -278,7 +278,9 @@ static void init(struct fmt_main *_self)

 static void reset(struct db_main *db)
 {
-   if (!program[gpu_id]) {
+   if (program[gpu_id])
+       release_clobj();
+   else {
        char build_opts[64];

        snprintf(build_opts, sizeof(build_opts), "-DPLAINTEXT_LENGTH=%u -DHASH_LOOPS=0x%x", PLAINTEXT_LENGTH, HASH_LOOPS);

After fixing that, we never allocate new buffers without releasing any existing ones, and we never create program/kernels/etc without having teared down any previous ones. I threw in debug counters and pointer checks to assert that. Still, there's this major memory leak.

Need to think. Here's what we create:

An OpenCL "program" (opencl_init, from format's reset)
One or more "kernels" (format's reset)
Host side and device side buffers and possibly mappings (create_clobj)

I'm pretty darn sure all of the above is correctly teared down so is there anything more? @claudioandre-br ?

magnumripper commented 4 years ago

Note BTW I'm not yet sure that diff is correct for things #4012 addressed in case of mask mode. For my current bug hunting, it's fine.

claudioandre-br commented 4 years ago

I'm pretty darn sure all of the above is correctly teared down so is there anything more? @claudioandre-br ?

I will debug it today (using --stress-test and sha256, fast and slow formats) to see if I find anything.

claudioandre-br commented 4 years ago

I can find a leak. Important leak. I would say it is related to how/when reset() is called (a relatively recent change?).

But there is something more.

claudioandre-br commented 4 years ago

Okay, I can see that the author of the format (me) did something wrong. And that's it! Everything fine now.

FTR, I'm testing using (dies in iteration 30, approx.).

make -sj4 && LWS=1024 GWS=163840 $JtR --format=sha256crypt-opencl -dev=4 -stress-test --no-mask

I will:

fix my slow formats;
fix autotune shared code (and test);
fix my fast formats (if needed).

Then, we will know if all formats need a fix.

claudioandre-br commented 4 years ago

Iteration 200 (edited, I ran up to 400) uses the same amount of memory as iteration number 1 uses.

|   0  GeForce GTX 1080    Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   93C    P2    70W / 180W |    315MiB /  8119MiB |    100%      Default |

claudioandre-br commented 4 years ago

I'm still interested in discussing reset(). How it works (it is a question, there are some blanks in my mind).

reset()
  autotune() // memory allocated.
-> what happens here? [1]
reset() (the real one, I guess)
  autotune() // memory allocated again.

magnum?

There is one leak left because I don't know what I can (or should) do in [1] above.

magnumripper commented 4 years ago

LOL I randomly picked RAR-opencl which was a very bad choice for finding problems in shared OpenCL code because it has some leak on its own, unrelated to this issue. I'll fix that and carry on with this.

magnumripper commented 4 years ago

I will continue my testing based on having #4354 merged in my tree.

I'm still interested in discussing reset(). How it works (it is a question, there are some blanks in my mind).
reset()
  autotune() // memory allocated.
-> what happens here? [1]
reset() (the real one, I guess)
  autotune() // memory allocated again.

Here are all situations for calling reset() I can think of right now:

Nothing is initialized. So we need to initalize OpenCL, create program, kernels and set up LWS/GWS and buffers (presumably very small, for self-tests).
Self-tests were finished and we just want to re-size LWS/GWS and buffers for actual cracking.
Internal mask changed, so we need to push that to GPU. We want to re-rune LWS/GWS if internal mask multiplier changed, but not if it didn't!

claudioandre-br commented 4 years ago

FWIW, this is fixed. One (perhaps some) format(s) have its own open issue.

claudioandre-br commented 4 years ago

Error creating context for device 4 (2:1)" probably means we forgot to update this (and a nearby) message to 1-based device numbers, right?

Yes.

solardiz commented 4 years ago

Error creating context for device 4 (2:1)" probably means we forgot to update this (and a nearby) message to 1-based device numbers, right?

Yes.

Will you take care of this, @claudioandre-br? If we don't fix this right away, then perhaps we should create an issue for it, or we'd forget again.

openwall / john

CL_OUT_OF_HOST_MEMORY with "--test --format=opencl" #4295