openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
https://www.openwall.com/john/
Other
10.05k stars 2.08k forks source link

Repair support for no-byte-addressable OpenCL devices #4254

Open lw3eov opened 4 years ago

lw3eov commented 4 years ago

Hello, I am posting this here and not at the mailing list because I think this can be a bug. I used the command suggested by @solardiz john 343.in --format=wpapsk-openc, I get the following (I am copying from my cmd window as much as allowed as there are lines of text before this which are not showing, probably because there are too many lines in total):

"kernels\opencl_sha2_ctx.h", line 179: error: write to < 32 bits via pointer
          not allowed unless cl_khr_byte_addressable_store is enabled
                PUT_UINT32BE(ctx->state[5], output, 20);
                ^

Edit by @solardiz: many occurrences of the above dropped.

Error limit reached.
100 errors detected in the compilation of ".\OCL9868.tmp.cl".
Compilation terminated.

Internal error: clc compiler invocation failed.

Error building kernel kernels/wpapsk_kernel.cl. DEVICE_INFO=4194826
0: OpenCL CL_BUILD_PROGRAM_FAILURE (-11) error in opencl_common.c:1386 - clBuild
Program

C:\john\run>
magnumripper commented 4 years ago

This looks like a problem with your GPU and/or its driver/runtime. You blatantly ignored the following template when you opened the issue:

IMPORTANT

This is not a support forum, it's a bug tracker. If you don't understand the difference, please do NOT open an issue. For questions and support, review recent postings on the john-users mailing list at https://www.openwall.com/lists/john-users/ and then subscribe to the list at https://www.openwall.com/john/#lists and post your message to the list by e-mailing john-users at lists.openwall.com.

We are not interested in bugs in your distro-supplied and/or several years old Jumbo. Please do not open an issue unless you tried the latest version from HERE first.

Steps to reproduce

Try to be clear about your environment and what you are doing. If possible, share a sample hash or file that can be used to reproduce.

System configuration

Attach details about your OS and about JtR, including:

  • $ ./john --list=build-info.
  • $ ./john --list=opencl-devices (if applicable).

Since you didn't read that but just erased it, we're left to guess what hardware and driver you have problem with. I wont try that.

magnumripper commented 4 years ago

Also, instead of posting one hundred instances of

"kernels\opencl_sha2_ctx.h", line 347: error: write to < 32 bits via pointer
not allowed unless cl_khr_byte_addressable_store is enabled
PUT_UINT64BE(ctx->state[3], output, 24);
^

Didn't it occur to you that you could have shortened it down a bit? Your behaviour doesn't promote helping you. In fact I'd have blocked you already if it wasn't for @solardiz

Do NOT open a new issue (if you do I WILL block you for life). You can amend the existing issues even though they are closed and if you supply information that make them valid, they might be re-opened.

lw3eov commented 4 years ago

I am sorry @magnumripper. It was not my intention to make anyone angry or to do something wrong. I posted many instances of that because I thought those may be needed to diagnose the problem. I didn't mention the software and hardware I am using as those appear in the image of my previous post with the same name (https://github.com/magnumripper/JohnTheRipper/issues/4253), however here there are: windows 7, John 1.9.0 jumbo (downloaded yesterday so last version to date), ATI RV770, ATI Radeon HD 4800 SERIES, Advanced Micro Devices.

solardiz commented 4 years ago

I took the liberty to edit this issue for it to make sense for us as a bug that we can work on fixing, or maybe deliberately decide not to.

A detail is that @lw3eov's problem report is apparently for the 1.9.0-jumbo-1 release, whereas our current tree has changed since that release. However, this different doesn't appear to be important - only the location of the kernels has changed, and the same problem was probably already present in 1.9.0-jumbo-1 and remained in our current tree so far.

The issue is as follows:

OpenCL doesn't guarantee that individual bytes in memory can be directly addressed. This is considered an OpenCL extension. While most GPUs support the functionality and thus the OpenCL extension, the ancient AMD HD 4800 series are known to lack this support. While few people have those ancient GPUs these days, a past JtR developer @sayan1an had one of those and put some effort into letting the OpenCL kernels that he worked on to run even on those poor GPUs. Reviewing the commits mentioning byte addressing, I now see that @claudioandre-br and @magnumripper also put some effort into that:

$ git log | grep -B3 'byte.*addressable' | head -19
Author: Sayantan Datta <std2048@gmail.com>
Date:   Sat Mar 21 18:14:15 2015 +0530

    Avoid byte addressable store on AMD gpus.
--
Author: magnum <john.magnum@hushmail.com>
Date:   Sat Oct 12 02:00:14 2013 +0200

    krb5pa-md5-opencl: Support devices that can't do byte addressable store.
--
Author: magnum <john.magnum@hushmail.com>
Date:   Tue Mar 26 21:04:23 2013 +0100

    NTLMv2 kernel: Bugfix for no-byte-addressable code path.
--
Author: Claudio André <claudio.andre@correios.net.br>
Date:   Fri Feb 8 21:15:11 2013 -0200

    Allow sha512crypt to be used on no_byte_addressable hardware.

and so on.

We might want to fix our code so that it's compatible with no-byte-addressable devices again.

lw3eov commented 4 years ago

Hi guys, will my card work with some older version of JtR? Thanks.

solardiz commented 4 years ago

@lw3eov Yes, your GPU will possibly work with an older version of JtR (but we don't readily know which version, if at all, and you're likely to bump into other issues when trying to go down that route), or (more practically) with other formats (not WPA-PSK and not anything involving SHA-2) in case you ever need to crack anything else. Trying an older version is further complicated by the fact that we started supporting GPUs in binary builds for Windows only recently (if you were on Linux, this would be easier). For now, your best bet is to use CPU only, or get a newer GPU, or wait for us to possibly fix this issue, or use another tool. Trying older versions isn't something I'd recommend, especially considering that I don't want this issue's comments to proceed into discussing old versions, which isn't helping us fix the issue.

If and when we have a new build for you to test (where we'd think the issue is possibly fixed), we'll let you know. Unless and until this happens, the above is the only advice we can give you on this issue.

lw3eov commented 4 years ago

OK, thanks a lot @solardiz

magnumripper commented 4 years ago

OpenCL doesn't guarantee that individual bytes in memory can be directly addressed. This is considered an OpenCL extension.

If memory serves me, what you say only applies to OpenCL 1.0. We agreed somewhere on requiring at least OpenCL 1.1 in Jumbo although right now I can't see it being actually documented (we should fix that). We often try to handle 32-bits at a time when at all possible, but that's for performance reasons.

solardiz commented 4 years ago

There's still some attempt at supporting no-byte-addressable devices in the tree, including for SHA-2:

./opencl_DES_kernel_params.h:#if no_byte_addressable(DEVICE_INFO)
./opencl_lm_kernel_params.h:#if no_byte_addressable(DEVICE_INFO)
./opencl_sha2_common.h:#if no_byte_addressable(DEVICE_INFO) || (gpu_amd(DEVICE_INFO) && defined(AMD_PUTCHAR_NOCAST))
./opencl_device_info.h:#define no_byte_addressable(n)      ((n & DEV_NO_BYTE_ADDRESSABLE))
./opencl_misc.h:#if no_byte_addressable(DEVICE_INFO) || !SCALAR || (gpu_amd(DEVICE_INFO) && defined(AMD_PUTCHAR_NOCAST))

OTOH, dropping that incomplete support for SHA-2 isn't going to simplify it much because we'd keep the '|| (gpu_amd(DEVICE_INFO) && defined(AMD_PUTCHAR_NOCAST))` condition and thus both implementations.

So do we try to repair and have complete no-byte-addressable support in the tree or declare it a best-effort (some formats might work, some don't)?

If HD 4800 series are practically the only devices affected, we might indeed choose not to waste time on this, especially given that we wouldn't have such old hardware for testing of new versions and thus this will keep breaking. (I suppose we could test-build kernels even for hardware that we don't have, but we'd probably not bother.)

solardiz commented 4 years ago

If memory serves me, what you say only applies to OpenCL 1.0.

Confirmed: "This extension was promoted to OpenCL 1.1 core."

So we can say that we require OpenCL 1.1+ and don't fully support devices that don't fully support OpenCL 1.1.

magnumripper commented 4 years ago

So we can say that we require OpenCL 1.1+ and don't fully support devices that don't fully support OpenCL 1.1.

Right. We should mention that in doc/README-OPENCL. I believe some formats would be very easy to fix but I'm not sure how to test it. I assume no vendor supports using the cl_khr_byte_addressable_store extension backwards for disabling it 😆

magnumripper commented 4 years ago

@lw3eov try opening run/kernels/opencl_sha2_ctx.h in some text editor and change the code in SHA256_Final at line 161 saying:

#if gpu_nvidia(DEVICE_INFO)
    if (!((size_t)output & 0x03)) {
        PUT_UINT32BE_ALIGNED(ctx->state[0], output,  0);
        PUT_UINT32BE_ALIGNED(ctx->state[1], output,  4);
        PUT_UINT32BE_ALIGNED(ctx->state[2], output,  8);
        PUT_UINT32BE_ALIGNED(ctx->state[3], output, 12);
        PUT_UINT32BE_ALIGNED(ctx->state[4], output, 16);
        PUT_UINT32BE_ALIGNED(ctx->state[5], output, 20);
        PUT_UINT32BE_ALIGNED(ctx->state[6], output, 24);
        PUT_UINT32BE_ALIGNED(ctx->state[7], output, 28);
    } else
#endif
    {
        PUT_UINT32BE(ctx->state[0], output,  0);
        PUT_UINT32BE(ctx->state[1], output,  4);
        PUT_UINT32BE(ctx->state[2], output,  8);
        PUT_UINT32BE(ctx->state[3], output, 12);
        PUT_UINT32BE(ctx->state[4], output, 16);
        PUT_UINT32BE(ctx->state[5], output, 20);
        PUT_UINT32BE(ctx->state[6], output, 24);
        PUT_UINT32BE(ctx->state[7], output, 28);
    }

Replace all the above with just this:

    PUT_UINT32BE_ALIGNED(ctx->state[0], output,  0);
    PUT_UINT32BE_ALIGNED(ctx->state[1], output,  4);
    PUT_UINT32BE_ALIGNED(ctx->state[2], output,  8);
    PUT_UINT32BE_ALIGNED(ctx->state[3], output, 12);
    PUT_UINT32BE_ALIGNED(ctx->state[4], output, 16);
    PUT_UINT32BE_ALIGNED(ctx->state[5], output, 20);
    PUT_UINT32BE_ALIGNED(ctx->state[6], output, 24);
    PUT_UINT32BE_ALIGNED(ctx->state[7], output, 28);

(You can do a similar change for the PUT_UINT64BE's in SHA512_Final starting at line 331)

Does that help? Perhaps there are other similar problems surfacing then, that are harder to fix.

solardiz commented 4 years ago

magnum, maybe we should avoid having the basic wpapsk-opencl format's kernel depend on SHA-2 at all? The current all-in-one kernel is unnecessarily fragile, which makes even the basic WPA-PSK functionality unavailable on systems that have issues with any aspect of the all-in-one kernel.

magnumripper commented 4 years ago

I'm not sure but I believe the 802.11w stuff [that is still WPA-PSK but need SHA-2] is nearly as "basic" as the older WPA/WPA2 algos and the user will normally not see the difference. But I see what you mean. We could refrain from building it when none of that type is loaded. Even better would be to soft-fail even after having tried to build it, and "unload" that type (assuming we have other types loaded as well). But that would add more complexity and maybe some core hacks.

magnumripper commented 4 years ago

Actually we could even fall back to CPU post-processing for that when needed. Fairly trivial although perhaps unreasonable work for a very small user-base.

lw3eov commented 4 years ago

@magnumripper Now I get this (I am only copying the last lines as the previous are similar, if you need the previous lines maybe I can upload a text file somewhere so I don't make it here very long):

"kernels\opencl_cmac.h", line 117: error: write to < 32 bits via pointer not allowed unless cl_khr_byte_addressable_store is enabled LSHIFT(K, K); ^

Error limit reached. 100 errors detected in the compilation of ".\OCL9E61.tmp.cl". Compilation terminated.

Internal error: clc compiler invocation failed.

Error building kernel kernels/wpapsk_kernel.cl. DEVICE_INFO=4194826 0: OpenCL CL_BUILD_PROGRAM_FAILURE (-11) error in opencl_common.c:1386 - clBuild Program

C:\john\run>

solardiz commented 4 years ago

FWIW, hashcat simply skips such devices regardless of requested hash type:

        if (strstr (device_extensions, "byte_addressable_store") == 0)
        {
          event_log_error (hashcat_ctx, "* Device #%u: This device does not support byte-addressable store.", device_id + 1);

          device_param->skipped = true;
        }
magnumripper commented 4 years ago

The "unless cl_khr_byte_addressable_store is enabled" part of the error message is confusing - sounds like it would build just fine if we explicitly enabled the extension. I guess it's just poor wording though.

I'm actually considering trying to get hold of some 1.0 device just for being able to test this - mostly for finding code we could write faster, but support for old devices will come as a spin-off. Perhaps one could trick POCL to treat any device as not being able to store bytes...

claudioandre-br commented 4 years ago

I'm actually considering trying to get hold of some 1.0 device

Worth it? I mean, how many people have one and would like to use it for cracking?

magnumripper commented 4 years ago

Like I said, it will help us find things we're still doing a byte at a time. Anything is doable in 32 bits at a time if you really want to - I had my doubts when implementing RC4 but even that can be unrolled. And once you get it right, it's FAST! 😎 The devices that can do byte-size writes are surely not doing it quickly.