openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
https://www.openwall.com/john/
Other
10.32k stars 2.1k forks source link

Recent changes in the autotune limit impair performance #3613

Closed claudioandre-br closed 5 years ago

claudioandre-br commented 5 years ago

Blame https://github.com/magnumripper/JohnTheRipper/commit/d7a8a228b8bc80eee5f9680b779679da6b3e595d#diff-136a558de24bf7376d97f8a2b7dd6ea3

$ "$JtR" -test=5 --format=raw-sha512-opencl -dev=4 --mask=?l?l?l?l?l?l?l?l
Device 4: GeForce GTX TITAN X
Benchmarking: raw-SHA512-opencl [SHA512 OpenCL/mask accel]... DONE
Raw:    241605K c/s real, 240668K c/s virtual, GPU util: 100%

$ _GPU_AUTOTUNE_LIMIT=500 "$JtR" -test=5 --format=raw-sha512-opencl -dev=4 --mask=?l?l?l?l?l?l?l?l
Device 4: GeForce GTX TITAN X
Benchmarking: raw-SHA512-opencl [SHA512 OpenCL/mask accel]... DONE
Raw:    557075K c/s real, 555972K c/s virtual, GPU util: 100%

I really have no idea how, but, I would like to keep source code as is and advertise the auto-tune "option".

solardiz commented 5 years ago

FWIW, on our GTX 1080 I'm getting 670M with default auto-tune (uses LWS=32 GWS=5120), 823M with _GPU_AUTOTUNE_LIMIT=500 (uses LWS=128 GWS=10240), and 854M with manual GWS=15360 (tunes to LWS=256). This is with ./john -test -form=raw-sha512-opencl -v=5 -mask='?a?a?a?a?a?a?a' -dev=3. Apparently, hashcat does 1078M.

claudioandre-br commented 5 years ago

Our maximum performance is 900M. Anyway:

But, how a real cracking session behaves? I can't do better than

0g 0:00:05:38 0,00% (ETA: 2019-05-10 04:41) 0g/s 703509Kp/s 703509Kc/s 13366MC/s GPU:92°C ##vIXvaa..##0Y$vaa
claudioandre-br commented 5 years ago

There is another problem. When running GPU mask mode, I use too much memory. Consider the (unlikely) leak and my cracking session:

for i in `seq 1000000000 9999999999`; do echo -n $i | sha512sum | sed 's/-/ /g'; done > ~/test512.in
_GPU_AUTOTUNE_LIMIT=500 "$JtR" ~/test512.in --format=raw-sha512-opencl -dev=3 --mask=?d?d?d?d?d?d?d?d?d?d

If, mask_cand = 10000 and GWS=10000, it means 100 millions keys per crypt_all() call. Note:

I can't ignore this situation. And that is too much memory.


That said, there are a lot of formats misbehaving, IMO. Example:

You malloc'ed 1 item, but in fact, you need 3, 5, 10, 50, ..., items.

magnumripper commented 5 years ago
  • In the worst case, I will have 100 millions keys that might be a crack;
  • So, in the worst case, I need to handle hash_ids = malloc(100M * 3 * sizeof(int)).

I can't ignore this situation. And that is too much memory.

That's just 1.2 GB, shouldn't be much of a problem, or do I miss something?

That said, there are a lot of formats misbehaving, IMO. Example:

  • I loaded only one hash;

    • num_loaded_hashes = 1
  • mask_cand = 10000 and GWS=10000
  • GPU can't discard all keys (unless you create a cmp_exact on GPU);
  • So, one will see 3, 5, 10, 50, ... keys that seems to be a crack;

    • all this 3, 5, 10, 50, ... keys have to be copied to CPU;
  • But formats do something like this:

    • hash_ids = (cl_uint*) mem_alloc((3 * num_loaded_hashes + 1) * sizeof(cl_uint));

You malloc'ed 1 item, but in fact, you need 3, 5, 10, 50, ..., items.

Interesting, and a bit confusing at first. Assuming you are right we need to change it to something like mem_alloc((3 * count * mask_int_cand.num_int_cand + 1) * sizeof(cl_uint));, is that right?

magnumripper commented 5 years ago

Please also note that we're talking virtual memory here. Even if mem_alloc((3 * count * mask_int_cand.num_int_cand + 1) * sizeof(cl_uint)) is a huge figure, as long as we only use a few pages that's the only thing actually consumed. We just need to refrain from using memset et. al. on it.

solardiz commented 5 years ago

Why 3 * and why + 1?

claudioandre-br commented 5 years ago

That's just 400 MB, shouldn't be much of a problem, or do I miss something?

It is 400 * 3 (I missed the 3 x in the example).

Why 3 * and why + 1?

  1. we are talking about a GPU buffer (limited by Max memory alloc. size property).
  2. 3 * -> GPU mask mode uses 3 "indexes"
  3. + 1 -> hash_id[0] is the number of cracks
key mask hash
set_key("min#e#') a - z 8846f713
set_key("you#e#') 0 - 9 31d6cfe0
- - etc

So, hash_id[]

hash_id[1 + 3 * index] = 0 (min#e#);
hash_id[2 + 3 * index] = 3 (d - 0);
hash_id[3 + 3 * index] = 31d6cfe0;
solardiz commented 5 years ago

Thanks, Claudio!

I think we previously determined (when loading the 320M HIBP hashes at once) that "Max memory alloc. size" is not actually enforced at least on NVIDIA GPUs lately, even though is still reported as a certain fraction of total GPU memory. So we shouldn't blindly give up if this limit would be exceeded - we should try anyway, and only give up if the allocation attempt fails.

BTW, what did you mean by "Our limit is 900k" in a previous comment?

claudioandre-br commented 5 years ago

what did you mean by "Our limit is 900k" in a previous comment?

The maximum performance is 900M.

solardiz commented 5 years ago

At what GWS/LWS is 900M reached? Can/should we improve auto-tuning to actually reach it?

claudioandre-br commented 5 years ago

I was talking about possibilities

[edited]. The best possible values (avoiding all non hashing stuff).

Raw speed figures including buffer transfers:
prep: 8.192us, xfer pass: 2.048us, idx: 11.424us, crypt: 53.863ms, result: 1.280us, mask xfer: 4.800us + 6.016us
gws:      2560  428668Kc/s   428668035 rounds/s  53.897ms per crypt_all()!
prep: 9.216us, xfer pass: 14.560us, idx: 8.064us, crypt: 53.911ms, result: 1.248us, mask xfer: 8.064us + 6.144us
gws:      5120  856356Kc/s   856356310 rounds/s  53.958ms per crypt_all()+
prep: 9.216us, xfer pass: 27.552us, idx: 14.464us, crypt: 93.178ms, result: 1.216us, mask xfer: 14.496us + 6.016us
gws:     10240  991036Kc/s   991036745 rounds/s  93.251ms per crypt_all()+
prep: 10.240us, xfer pass: 53.408us, idx: 27.744us, crypt: 187.500ms, result: 1.248us, mask xfer: 27.424us + 6.016us
gws:     20480  985105Kc/s   985105397 rounds/s 187.626ms per crypt_all()
prep: 9.216us, xfer pass: 112.256us, idx: 53.152us, crypt: 376.536ms, result: 1.248us, mask xfer: 52.960us + 6.144us
gws:     40960  981137Kc/s   981137062 rounds/s 376.771ms per crypt_all()
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:722 - failed in clEnqueueNDRangeKernel I
 (error occurred)
Local worksize (LWS) 256, global worksize (GWS) 10240 (40 blocks)
DONE
Raw:    969910K c/s real, 969910K c/s virtual, GPU util: 100%
claudioandre-br commented 5 years ago

Well, we can retry to change Maj, ror, Sigma0, ... as in ba8fe8166482a51569a114829f31c48f0691dc79.

Magnum, what can you say about your tests?


Lut3 is useless according to my tests.

magnumripper commented 5 years ago

LUT3 is everything but useless, we just don't need to use it explicitly. My experimental LUT3-64 (consisting of 2xLUT3-32) that you referenced hasn't proven good anywhere. And with newer drivers, explicit LUT3 in general doesn't give much because LUT3 is used anyway, and likely better, by the optimizer.

Disregarding LUT3 though, there are now several alternative sigma functions in opencl_sha2.h - it could be wise to try them all out (changing to #if 0 ... #elif 1 and so on). They are optimized for various cases - destructive instructions or not, hardware rotate or not and so on.

magnumripper commented 5 years ago

Blame https://github.com/magnumripper/JohnTheRipper/commit/d7a8a228b8bc80eee5f9680b779679da6b3e595d#diff-136a558de24bf7376d97f8a2b7dd6ea3

I wonder why we need several hundreds of milliseconds to reach lift-off GWS? That's not needed in NT, nor in hashcat, is it?

That's just 400 MB, shouldn't be much of a problem, or do I miss something?

It is 400 * 3 (I missed the 3 x in the example).

So it's 1.2 GB, still no big deal. A weak card with too little memory is not likely to need that high GWS anyway, so self regulatiing.

claudioandre-br commented 5 years ago

That's just 400 MB, shouldn't be much of a problem, or do I miss something?

It is 400 * 3 (I missed the 3 x in the example).

So it's 1.2 GB, still no big deal. A weak card with too little memory is not likely to need that high GWS anyway, so self regulatiing.

The 1.2 GB is for GWS = 10240. Since GTX 1080 max memory alloc. size is 2029 MB, it won't go to GWS = 20480. This is one of the motivations of the question https://github.com/magnumripper/JohnTheRipper/issues/3613#issuecomment-456083284

claudioandre-br commented 5 years ago

Blame d7a8a22#diff-136a558de24bf7376d97f8a2b7dd6ea3

I wonder why we need several hundreds of milliseconds to reach lift-off GWS? That's not needed in NT, nor in hashcat, is it?

NT is faster. SHA512 needs 376.771ms to test 40K keys that reaches 980M.

Maybe I'm missing out on something, but I do not see any shortcuts.

claudioandre-br commented 5 years ago

Well, personally, I'm not sure how to use GPU mask mode at full speed. Maybe I'm using it incorrectly.

So, I use trial and error to see what seems the best mask. E.g.

$ "$JtR" -test=5 --format=raw-sha512-rev-opencl -dev=0 --mask=?a?a?a?a
Device 0: gfx900 [Radeon RX Vega]
Benchmarking: raw-SHA512-rev-opencl [SHA512 OpenCL/mask accel]... DONE
Raw:    490002K c/s real, 6793M c/s virtual

$ "$JtR" -test=5 --format=raw-sha512-rev-opencl -dev=0 --mask=?l?l?l?l
Device 0: gfx900 [Radeon RX Vega]
Benchmarking: raw-SHA512-rev-opencl [SHA512 OpenCL/mask accel]... DONE
Raw:    139518K c/s real, 11998M c/s virtual
magnumripper commented 5 years ago

2022 comes to mind... the selection of internal mask needs more granularity.

magnumripper commented 5 years ago

So it's 1.2 GB, still no big deal. A weak card with too little memory is not likely to need that high GWS anyway, so self regulatiing.

The 1.2 GB is for GWS = 10240. Since GTX 1080 max memory alloc. size is 2029 MB, it won't go to GWS = 20480.

SHA512 needs 376.771ms to test 40K keys that reaches 980M.

I believe you have a too large internal mask target. It's unreasonable to need 20M keys and that large duration to hide latencies for a relatively slow format. NT only needs opencl_speed_index(gpu_id) / 300 which means a target mask multiplier of 14788 on a 1080. It seems to end up at a GWS of 20480 at a multiplier of 10890, with a kernel duration of only 12.7 ms.

SHA-512 is much slower so should basically need less. There's got to be a better way of getting more out of it.

magnumripper commented 5 years ago
Hashmode: 1700 - SHA2-512

Speed.#4.........:  1055.3 MH/s (79.13ms) @ Accel:128 Loops:128 Thr:256 Vec:1

hashcat gets over 1 G with 80 ms duration...

solardiz commented 5 years ago

Added 1.9.0-jumbo-1 milestone for us to consider doing something about this issue in time for the release - e.g., printing a message suggesting use of _GPU_AUTOTUNE_LIMIT=500. Any other ideas?

magnumripper commented 5 years ago

I’d rather have it autotune fine without any extra options. Let’s revert the changes in the few formats where we’ve seen regression: Is that just raw-sha512 or any others?

magnumripper commented 5 years ago

At some point in the future though, we should try to find out why we need so much threads/duration to hide latencies while hashcat doesn’t!

solardiz commented 5 years ago

Let’s revert the changes in the few formats where we’ve seen regression

What changes do you refer to? I was under impression that a global change "in the autotune limit" (where is that?) affected a few formats. So do we somehow override that reduction of the limit just for those formats? Where would we do that? Perhaps this actually means adding code to selectively override the limit, rather than literally reverting any code change?

solardiz commented 5 years ago

Elsewhere on the same topic, Claudio just wrote:

"We can change the default value (tuning time) to 200 or to 500, for example.

Split the kernel to make it run faster seems inappropriate to me. So, I won't do it before mask fixes.

magnumripper commented 5 years ago

I assumed the problem was the specific change (limit decreased from 500 to 100) to raw-sha512 in the mentioned commit

solardiz commented 5 years ago

Per my testing, raw-sha256-opencl is affected as well. @claudioandre-br Please increase its limit to 500ms as well. Thanks!

[solar@well run]$ _GPU_AUTOTUNE_LIMIT=500 ./john -test -form=raw-sha256-opencl -v=4 -mask
Note: Self-tests currently not performed when using -mask with -test
Device 1: Tahiti [AMD Radeon HD 7900 Series]
Benchmarking: raw-SHA256-opencl [SHA256 OpenCL/mask accel]... Internal mask, multiplier: 6760 (target: 6826)

Local worksize (LWS) 256, global worksize (GWS) 16384 (64 blocks)
DONE
Raw:    794898K c/s real, 15103M c/s virtual, Dev#1 util: 64%

[solar@well run]$ ./john -test -form=raw-sha256-opencl -v=4 -mask
Note: Self-tests currently not performed when using -mask with -test
Device 1: Tahiti [AMD Radeon HD 7900 Series]
Benchmarking: raw-SHA256-opencl [SHA256 OpenCL/mask accel]... Internal mask, multiplier: 6760 (target: 6826)

Local worksize (LWS) 128, global worksize (GWS) 8192 (64 blocks)
DONE
Raw:    413268K c/s real, 13844M c/s virtual, Dev#1 util: 64%
solardiz commented 5 years ago

Fixed with #3855.