Closed claudioandre-br closed 5 years ago
FWIW, on our GTX 1080 I'm getting 670M with default auto-tune (uses LWS=32 GWS=5120), 823M with _GPU_AUTOTUNE_LIMIT=500
(uses LWS=128 GWS=10240), and 854M with manual GWS=15360
(tunes to LWS=256). This is with ./john -test -form=raw-sha512-opencl -v=5 -mask='?a?a?a?a?a?a?a' -dev=3
. Apparently, hashcat does 1078M.
Our maximum performance is 900M. Anyway:
Internal mask, multiplier: 9025
a good value?But, how a real cracking session behaves? I can't do better than
0g 0:00:05:38 0,00% (ETA: 2019-05-10 04:41) 0g/s 703509Kp/s 703509Kc/s 13366MC/s GPU:92°C ##vIXvaa..##0Y$vaa
There is another problem. When running GPU mask mode, I use too much memory. Consider the (unlikely) leak and my cracking session:
for i in `seq 1000000000 9999999999`; do echo -n $i | sha512sum | sed 's/-/ /g'; done > ~/test512.in
_GPU_AUTOTUNE_LIMIT=500 "$JtR" ~/test512.in --format=raw-sha512-opencl -dev=3 --mask=?d?d?d?d?d?d?d?d?d?d
If, mask_cand = 10000 and GWS=10000, it means 100 millions keys per crypt_all() call. Note:
hash_ids = malloc(100M * sizeof(int))
x 3.I can't ignore this situation. And that is too much memory.
That said, there are a lot of formats misbehaving, IMO. Example:
num_loaded_hashes = 1
cmp_exact
on GPU);hash_ids = (cl_uint*) mem_alloc((3 * num_loaded_hashes + 1) * sizeof(cl_uint));
You malloc'ed 1 item, but in fact, you need 3, 5, 10, 50, ..., items.
- In the worst case, I will have 100 millions keys that might be a crack;
- So, in the worst case, I need to handle
hash_ids = malloc(100M * 3 * sizeof(int))
.I can't ignore this situation. And that is too much memory.
That's just 1.2 GB, shouldn't be much of a problem, or do I miss something?
That said, there are a lot of formats misbehaving, IMO. Example:
I loaded only one hash;
num_loaded_hashes = 1
- mask_cand = 10000 and GWS=10000
- GPU can't discard all keys (unless you create a
cmp_exact
on GPU);So, one will see 3, 5, 10, 50, ... keys that seems to be a crack;
- all this 3, 5, 10, 50, ... keys have to be copied to CPU;
But formats do something like this:
hash_ids = (cl_uint*) mem_alloc((3 * num_loaded_hashes + 1) * sizeof(cl_uint));
You malloc'ed 1 item, but in fact, you need 3, 5, 10, 50, ..., items.
Interesting, and a bit confusing at first. Assuming you are right we need to change it to something like mem_alloc((3 * count * mask_int_cand.num_int_cand + 1) * sizeof(cl_uint));
, is that right?
Please also note that we're talking virtual memory here. Even if mem_alloc((3 * count * mask_int_cand.num_int_cand + 1) * sizeof(cl_uint))
is a huge figure, as long as we only use a few pages that's the only thing actually consumed. We just need to refrain from using memset et. al. on it.
Why 3 *
and why + 1
?
That's just 400 MB, shouldn't be much of a problem, or do I miss something?
It is 400 * 3 (I missed the 3 x in the example).
Why 3 * and why + 1?
Max memory alloc. size
property).3 *
-> GPU mask mode uses 3 "indexes"+ 1
-> hash_id[0] is the number of crackskey | mask | hash |
---|---|---|
set_key("min#e#') | a - z | 8846f713 |
set_key("you#e#') | 0 - 9 | 31d6cfe0 |
- | - | etc |
So, hash_id[]
hash_id[1 + 3 * index] = 0 (min#e#);
hash_id[2 + 3 * index] = 3 (d - 0);
hash_id[3 + 3 * index] = 31d6cfe0;
Thanks, Claudio!
I think we previously determined (when loading the 320M HIBP hashes at once) that "Max memory alloc. size" is not actually enforced at least on NVIDIA GPUs lately, even though is still reported as a certain fraction of total GPU memory. So we shouldn't blindly give up if this limit would be exceeded - we should try anyway, and only give up if the allocation attempt fails.
BTW, what did you mean by "Our limit is 900k" in a previous comment?
what did you mean by "Our limit is 900k" in a previous comment?
The maximum performance is 900M.
At what GWS/LWS is 900M reached? Can/should we improve auto-tuning to actually reach it?
I was talking about possibilities
[edited]. The best possible values (avoiding all non hashing stuff).
Raw speed figures including buffer transfers:
prep: 8.192us, xfer pass: 2.048us, idx: 11.424us, crypt: 53.863ms, result: 1.280us, mask xfer: 4.800us + 6.016us
gws: 2560 428668Kc/s 428668035 rounds/s 53.897ms per crypt_all()!
prep: 9.216us, xfer pass: 14.560us, idx: 8.064us, crypt: 53.911ms, result: 1.248us, mask xfer: 8.064us + 6.144us
gws: 5120 856356Kc/s 856356310 rounds/s 53.958ms per crypt_all()+
prep: 9.216us, xfer pass: 27.552us, idx: 14.464us, crypt: 93.178ms, result: 1.216us, mask xfer: 14.496us + 6.016us
gws: 10240 991036Kc/s 991036745 rounds/s 93.251ms per crypt_all()+
prep: 10.240us, xfer pass: 53.408us, idx: 27.744us, crypt: 187.500ms, result: 1.248us, mask xfer: 27.424us + 6.016us
gws: 20480 985105Kc/s 985105397 rounds/s 187.626ms per crypt_all()
prep: 9.216us, xfer pass: 112.256us, idx: 53.152us, crypt: 376.536ms, result: 1.248us, mask xfer: 52.960us + 6.144us
gws: 40960 981137Kc/s 981137062 rounds/s 376.771ms per crypt_all()
0: OpenCL CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error in opencl_rawsha512_gpl_fmt_plug.c:722 - failed in clEnqueueNDRangeKernel I
(error occurred)
Local worksize (LWS) 256, global worksize (GWS) 10240 (40 blocks)
DONE
Raw: 969910K c/s real, 969910K c/s virtual, GPU util: 100%
Well, we can retry to change Maj, ror, Sigma0, ... as in ba8fe8166482a51569a114829f31c48f0691dc79.
Magnum, what can you say about your tests?
Lut3 is useless according to my tests.
LUT3 is everything but useless, we just don't need to use it explicitly. My experimental LUT3-64 (consisting of 2xLUT3-32) that you referenced hasn't proven good anywhere. And with newer drivers, explicit LUT3 in general doesn't give much because LUT3 is used anyway, and likely better, by the optimizer.
Disregarding LUT3 though, there are now several alternative sigma functions in opencl_sha2.h - it could be wise to try them all out (changing to #if 0 ... #elif 1
and so on). They are optimized for various cases - destructive instructions or not, hardware rotate or not and so on.
I wonder why we need several hundreds of milliseconds to reach lift-off GWS? That's not needed in NT, nor in hashcat, is it?
That's just 400 MB, shouldn't be much of a problem, or do I miss something?
It is 400 * 3 (I missed the 3 x in the example).
So it's 1.2 GB, still no big deal. A weak card with too little memory is not likely to need that high GWS anyway, so self regulatiing.
That's just 400 MB, shouldn't be much of a problem, or do I miss something?
It is 400 * 3 (I missed the 3 x in the example).
So it's 1.2 GB, still no big deal. A weak card with too little memory is not likely to need that high GWS anyway, so self regulatiing.
The 1.2 GB is for GWS = 10240. Since GTX 1080 max memory alloc. size is 2029 MB, it won't go to GWS = 20480. This is one of the motivations of the question https://github.com/magnumripper/JohnTheRipper/issues/3613#issuecomment-456083284
Blame d7a8a22#diff-136a558de24bf7376d97f8a2b7dd6ea3
I wonder why we need several hundreds of milliseconds to reach lift-off GWS? That's not needed in NT, nor in hashcat, is it?
NT is faster. SHA512 needs 376.771ms to test 40K keys that reaches 980M.
Maybe I'm missing out on something, but I do not see any shortcuts.
Well, personally, I'm not sure how to use GPU mask mode at full speed. Maybe I'm using it incorrectly.
So, I use trial and error to see what seems the best mask. E.g.
$ "$JtR" -test=5 --format=raw-sha512-rev-opencl -dev=0 --mask=?a?a?a?a
Device 0: gfx900 [Radeon RX Vega]
Benchmarking: raw-SHA512-rev-opencl [SHA512 OpenCL/mask accel]... DONE
Raw: 490002K c/s real, 6793M c/s virtual
$ "$JtR" -test=5 --format=raw-sha512-rev-opencl -dev=0 --mask=?l?l?l?l
Device 0: gfx900 [Radeon RX Vega]
Benchmarking: raw-SHA512-rev-opencl [SHA512 OpenCL/mask accel]... DONE
Raw: 139518K c/s real, 11998M c/s virtual
So it's 1.2 GB, still no big deal. A weak card with too little memory is not likely to need that high GWS anyway, so self regulatiing.
The 1.2 GB is for GWS = 10240. Since GTX 1080 max memory alloc. size is 2029 MB, it won't go to GWS = 20480.
SHA512 needs 376.771ms to test 40K keys that reaches 980M.
I believe you have a too large internal mask target. It's unreasonable to need 20M keys and that large duration to hide latencies for a relatively slow format. NT only needs opencl_speed_index(gpu_id) / 300
which means a target mask multiplier of 14788 on a 1080. It seems to end up at a GWS of 20480 at a multiplier of 10890, with a kernel duration of only 12.7 ms.
SHA-512 is much slower so should basically need less. There's got to be a better way of getting more out of it.
Hashmode: 1700 - SHA2-512
Speed.#4.........: 1055.3 MH/s (79.13ms) @ Accel:128 Loops:128 Thr:256 Vec:1
hashcat gets over 1 G with 80 ms duration...
Added 1.9.0-jumbo-1 milestone for us to consider doing something about this issue in time for the release - e.g., printing a message suggesting use of _GPU_AUTOTUNE_LIMIT=500
. Any other ideas?
I’d rather have it autotune fine without any extra options. Let’s revert the changes in the few formats where we’ve seen regression: Is that just raw-sha512 or any others?
At some point in the future though, we should try to find out why we need so much threads/duration to hide latencies while hashcat doesn’t!
Let’s revert the changes in the few formats where we’ve seen regression
What changes do you refer to? I was under impression that a global change "in the autotune limit" (where is that?) affected a few formats. So do we somehow override that reduction of the limit just for those formats? Where would we do that? Perhaps this actually means adding code to selectively override the limit, rather than literally reverting any code change?
Elsewhere on the same topic, Claudio just wrote:
"We can change the default value (tuning time) to 200 or to 500, for example.
Split the kernel to make it run faster seems inappropriate to me. So, I won't do it before mask fixes.
I assumed the problem was the specific change (limit decreased from 500 to 100) to raw-sha512 in the mentioned commit
Per my testing, raw-sha256-opencl is affected as well. @claudioandre-br Please increase its limit to 500ms as well. Thanks!
[solar@well run]$ _GPU_AUTOTUNE_LIMIT=500 ./john -test -form=raw-sha256-opencl -v=4 -mask
Note: Self-tests currently not performed when using -mask with -test
Device 1: Tahiti [AMD Radeon HD 7900 Series]
Benchmarking: raw-SHA256-opencl [SHA256 OpenCL/mask accel]... Internal mask, multiplier: 6760 (target: 6826)
Local worksize (LWS) 256, global worksize (GWS) 16384 (64 blocks)
DONE
Raw: 794898K c/s real, 15103M c/s virtual, Dev#1 util: 64%
[solar@well run]$ ./john -test -form=raw-sha256-opencl -v=4 -mask
Note: Self-tests currently not performed when using -mask with -test
Device 1: Tahiti [AMD Radeon HD 7900 Series]
Benchmarking: raw-SHA256-opencl [SHA256 OpenCL/mask accel]... Internal mask, multiplier: 6760 (target: 6826)
Local worksize (LWS) 128, global worksize (GWS) 8192 (64 blocks)
DONE
Raw: 413268K c/s real, 13844M c/s virtual, Dev#1 util: 64%
Fixed with #3855.
Blame https://github.com/magnumripper/JohnTheRipper/commit/d7a8a228b8bc80eee5f9680b779679da6b3e595d#diff-136a558de24bf7376d97f8a2b7dd6ea3
I really have no idea how, but, I would like to keep source code as is and advertise the auto-tune "option".