openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
https://www.openwall.com/john/
Other
10.04k stars 2.08k forks source link

GPU mask mode, general discussion #1037

Closed magnumripper closed 8 years ago

magnumripper commented 9 years ago

For future experimental/incomplete commits (eg. when taking a stab at some UTF-16 format), please use the topic branch again so we keep the amount of likely-broken code in bleeding-jumbo as low as possible. But let's merge things as soon as they seem stable so it doesn't diverge too much. Also, we often don't find out about problems until we actually merge (#1036 is a good example).

Let's use this issue for general discussion. For specific problems we should create separate issues.

magnumripper commented 9 years ago

I saw 6-7G c/s on a GTX 980 with current code. My laptop GT 650M does 3-400M c/s.

magnumripper commented 9 years ago

FWIW I saw this once while running TS

form=raw-md4-opencl               guesses:    0 -show=   0 unk unk : Expected count(s) (1500)  [!!!FAILED!!!  return code 256]
Self test failed (cmp_one(1))

...but I can't reproduce it. This was with Apple's CPU device and TS sets LWS=8 and GWS=64 (although this device will force LWS down to 1).

sayan1an commented 9 years ago

I'm getting some issues with cpu devices too. Apparently GWS=512 and LWS=64 seems to be the minimum limit.

magnumripper commented 9 years ago

But there is no particular reason there would be such limit, right? Maybe it's just another barrier needed somewhere or something like that.

magnumripper commented 9 years ago

@Sayantan2048 I think this patch fixes the performance counters for good, please review

diff --git a/src/cracker.c b/src/cracker.c
index 45fb099..4aadc71 100644
--- a/src/cracker.c
+++ b/src/cracker.c
@@ -47,6 +47,7 @@
 #include "recovery.h"
 #include "external.h"
 #include "options.h"
+#include "mask_ext.h"
 #include "mask.h"
 #include "unicode.h"
 #include "john.h"
@@ -747,7 +748,8 @@ static int crk_salt_loop(void)
    } while ((salt = salt->next));

    if (done >= 0)
-       add32to64(&status.cands, crk_key_index);
+       add32to64(&status.cands, crk_key_index *
+                 mask_int_cand.num_int_cand);

    if (salt)
        return 1;

This code assumes mask_int_cand.num_int_cand is always 1 unless GPU generation is active. In particular, it has to be 1 even if mask mode was not used or initialized at all (or always initialized to a degree).

magnumripper commented 9 years ago

BTW we should consider the possibility for overrun here. The result of crk_key_index * mask_int_cand.num_int_cand must fit in 32-bit or we'll need to use add64to64() instead.

magnumripper commented 9 years ago

Here's code that is safe for that. However, I think we'll never need that high number for a single crypt call.

    if (done >= 0) {
        int64 totcand;
        mul32by32(&totcand, crk_key_index, mask_int_cand.num_int_cand);
        add64to64(&status.cands, &totcand);
    }       
magnumripper commented 9 years ago

This code assumes mask_int_cand.num_int_cand is always 1 unless GPU generation is active.

It always is - but I totally fail to see how/where that happens! So I don't dare committing this.

sayan1an commented 9 years ago

add32to64(&status.cands, crk_key_index * mask_int_cand.num_int_cand);

Are you sure it won't interfere with *pcount as it already updates inside crypt_all(). I mean please check we don't have a situation where mask_int_cand.num_int_cand is multiplied twice, once inside crypt_all() and again inside crk_salt_loop().

sayan1an commented 9 years ago
It always is - but I totally fail to see how/where that happens! So I don't dare committing this.

See line 17 mask_ext.c. It is set to 1 even when there is no mask mode.

magnumripper commented 9 years ago

Are you sure it won't interfere with *pcount as it already updates inside crypt_all()

That I'm sure of. The thing you mention happens once per salt and updates everything but p/s. This one take care of p/s and is not multiplied by salts.

Ah, yes it's statically initialized to 1. I will merge this now then!

magnumripper commented 9 years ago

154f00d

claudioandre-br commented 9 years ago

Mask has some problems with -dev=cpu. Trying to debug, it always blames /usr/lib/libamdocl64.so

[..] hashes.txt -form=raw-md4-opencl --mask=passwor?l -dev=1
Device 1: AMD Phenom(tm) II X6 1075T Processor
Local worksize (LWS) 32, global worksize (GWS) 262144
Using Mask Mode with internal candidate generation,global worksize(GWS) set to 16384
Loaded 3 password hashes with no different salts (Raw-MD4-opencl [MD4 OpenCL (inefficient, development use only)])
Press 'q' or Ctrl-C to abort, almost any other key for status
Falha de segmentação (imagem do núcleo gravada) *segfault*

[Fixed now, works now on CPU and GPU] BTW: happens the same way in raw-sha256-opencl (not able to nailed it), on gpu it works fine

0g 0:00:00:42 N/A 0g/s 147410Kp/s 147410Kc/s 1031MC/s GPU:57°C util:99% fan:46% aaaicbua..aaavdhua
sayan1an commented 9 years ago

Where was the problem, with mask mode or the format or common opencl code?

claudioandre-br commented 9 years ago

Is it possible that this mask (sometimes) misbehave --mask=[Pp][Aa@][Ss5][Ss5][Ww][Oo0][Rr][Dd]

  1. [#ASSWORD] is expected? And it missed the key "password"
0g 0:00:00:00  0g/s 2592p/s 2592c/s 15552C/s GPU:45°C fan:40% #ASSWORD..O###W#RD####w#RD####W#rD####w#rD####W#Rd####w#Rd####W#r
  1. Ok [password and P@55w0rD] cracked keys.
2g 0:00:00:00  8.333g/s 5400p/s 5400c/s 32400C/s GPU:45°C fan:40% password..O###W#RDp###W#RDP###w#RDp###w#RDP###W#rDp###W#rDP###w#r
W
claudioandre-br commented 9 years ago

Where was the problem, with mask mode or the format or common opencl code?

I checked every format allocation (all details). One of them was causing it.

magnumripper commented 9 years ago

[#ASSWORD] is expected?

Off the top of my head, I think a mask like -mask=?l?l?lword will only ever show as ###word in the status lines. The GPU-side of the mask is shown as #'s.

claudioandre-br commented 9 years ago

Ok, I will try to nail the problem with this particular mask.


A side note, putting this mask stuff hurts benchmark numbers for 'a regular run' a lot. But, a real 'regular run' has, basically, the same performance that it had using old mask-less source code. I compared sha256 (new and old) and md4 and md5 (I guess It makes sense).

I was planning to create two kernels, but it seems useless to a real user (no gain or loss). Anyone disagrees?

magnumripper commented 9 years ago

Once GPU-side mask is universally working, the self-test will benchmark it, and (I guess) show a separate figure for that speed.

claudioandre-br commented 9 years ago

There is something wrong with this mask expansion. Result of analisys (--skip-self-tests and no autotune).

Somehow, what is calling set_key behaves in 2 different ways. And, for some reason, it fails sometimes.

rm ../run/*.pot; LWS=128 GWS=1048576 ../run/john ~/testhashes -form=raw-md4-opencl --mask=[Pp][Aa@][Ss5][Ss5][Ww][Oo0][Rr][Dd] 
Device 0: Juniper [AMD Radeon HD 6700 Series]
Local worksize (LWS) 128, global worksize (GWS) 1048576
Loaded 1 password hash (Raw-MD4-opencl [MD4 OpenCL (inefficient, development use only)])
Press 'q' or Ctrl-C to abort, almost any other key for status
Get key error! 647 647
0g 0:00:00:00  0g/s 3600p/s 3600c/s 3600C/s GPU:43°C fan:40% #ASSWORD..#ASSWORD
Session completed

Above is an example, I used sha256 to get the numbers.

sayan1an commented 9 years ago

When GPU side mask is activated (currently only available on raw-md4-opencl), it is expected that set_keys() is called fewer times as some portion of the key is generated by the format.

The above case seems like a bug to me. Get key error! 647 647 Can you give me the hash that you were supposed to crack in the above raw-md4-opencl example.

Update: I have found some ASAN bug in core mask mode code. Now fixed. commit a67153c9a3711d773015e3af2ebee216c8c34b40

sayan1an commented 9 years ago

@magnumripper commit a6490321f8ee7718ad is causing performance degradation due to register spilling on 7970 with catalyst 14.12. Speed is now reduced form 3.9Gc/s to 1.8Gc/s for raw-md4-opencl

magnumripper commented 9 years ago

That's in raw-MD4 or what? That's odd, this is an optimization we should have, and the parens are merely making it more obvious to the compiler. An alternative is actually coding the optimization with a tmp variable.

magnumripper commented 9 years ago

BTW I will try it will 14.9 - the 14.12 is known as the worst driver version ever.

claudioandre-br commented 9 years ago

Can you give me the hash

password -> 8a9d093f14f8701df17732b2bb182c74

claudioandre-br commented 9 years ago

When GPU side mask is activated [..] it is expected that set_keys() is called fewer times

I know that. My point is: when running in GPU mask mode.

sayan1an commented 9 years ago

@magnumripper I tried to build john on well, but it failed.

/tmp/ccpEnmM3.s: Assembler messages:
/tmp/ccpEnmM3.s:434: Error: no such instruction: `vfmadd312sd .LC5(%rip),%xmm0,%xmm2'

Regarding commit a649032, should we wait for the next driver release ?

claudioandre-br commented 9 years ago

Disable native tests:

It will end up with errors like "no such instruction: `vfmadd312sd ...". 
The workaround is to add the option "--disable-native-march" to configure, 
which will stop it from ever adding that compiler option.
sayan1an commented 9 years ago

@claudioandre I can't reproduce this issue with raw-md4-opencl on well using 6770 and the command line options exactly matching yours. It cracks ok. Can you give me more details on how to reproduce this error i.e. Get key error! 647 647

claudioandre-br commented 9 years ago

I just run it a few times (it happens randomly). [any recent commit might solved it, I need to re-test]


Or better, where should I put some printf to show important var situation when the error happens? For example, the loop that calls set_key()

claudioandre-br commented 9 years ago

Seems a recent commit improved the situation. Usually 3 or 4 runs are enough to find the problem. It is much better now. No error for much more attempts.

I will do a full test soon.

sayan1an commented 9 years ago

you may add a printf at line 1354 in mask.c.

claudioandre-br commented 9 years ago

On well (it is not a real problem, but you need to check and think about it).

$ cat  8a9d093f14f8701df17732b2bb182c74 > ~/testhashes
$ ../run/john ~/testhashes -form=raw-md4-opencl --mask=clau?a?l?l?d?d?d -dev=0 --skip
NVIDIA: failed to execute `/usr/bin/nvidia-modprobe`: Permission denied.
NVIDIA: failed to load the NVIDIA kernel module.
Device 0: Bonaire [AMD Radeon HD 7700 Series]
Build log: 
Warning: md4 kernel has register spilling. Lower performance is expected.

Local worksize (LWS) 64, global worksize (GWS) 4194304
Using Mask Mode with internal candidate generation,global worksize(GWS) set to 262144
Loaded 1 password hash (Raw-MD4-opencl [MD4 OpenCL (inefficient, development use only)])
Press 'q' or Ctrl-C to abort, almost any other key for status
Get key error! 64219999 64219999
0g 0:00:00:00  0g/s 428133Kp/s 428133Kc/s 428133KC/s GPU:33°C fan:20% clau aa000..clau aa000
Session completed

PS: Should you remove (inefficient, development use only)?

magnumripper commented 9 years ago

@magnumripper I tried to build john on well, but it failed.

That is an unrelated issue. Use --disable-native-tests on that machine.

magnumripper commented 9 years ago

@magnumripper commit a649032 is causing performance degradation

I do not see that with 14.9. See 667c471a - I actually do not see any difference no matter what code we use. However, that commit does revert to the original behaviour. Does it fix the problem on 14.12?

magnumripper commented 9 years ago

However, that commit does revert to the original behaviour

Sorry, it does not. To do that, the first #if should be 1. Feel free to commit that once you verify there is a difference to the better.

sayan1an commented 9 years ago

Sorry, it does not. To do that, the first #if should be 1. Feel free to commit that once you verify there is a difference to the better.

Sorry, I gave you wrong info. Actually, reset --hard a649032, fixed performance issues, however, I misinterpreted it as causing problem. Main culprit is the subsequent commit 573a8c21afb2b.

magnumripper commented 9 years ago

Main culprit is the subsequent commit 573a8c2

That is very interesting... the only net change is the PUTCHAR macro.

We were using

#define PUTCHAR(buf, index, val) (buf)[(index)>>2] = ((buf)[(index)>>2] & ~(0xffU << (((index) & 3) << 3))) + ((val) << (((index) & 3) << 3))

And now it will be

#define PUTCHAR(buf, index, val) ((uchar*)(buf))[index] = (val)

...but we can change the #if stuff in opencl_misc.h so we end up using same as before. Please verify that this is the actual cause, and feel free to change opencl_misc.h to something like

  #define GETCHAR_MC(buf, index) (((MAYBE_CONSTANT uchar*)(buf))[(index)])
  #define LASTCHAR_BE(buf, index, val) (buf)[(index)>>2] = ((buf)[(index)>>2] & (0xffffff00U << ((((index) & 3) ^ 3) << 3))) + ((val) << ((((index) & 3) ^ 3) << 3))

- #if no_byte_addressable(DEVICE_INFO) || !defined(SCALAR) /* 32-bit stores */
+ #if no_byte_addressable(DEVICE_INFO) || !defined(SCALAR) || gpu_amd(DEVICE_INFO)
+ /* 32-bit stores */
  #define PUTCHAR(buf, index, val) (buf)[(index)>>2] = ((buf)[(index)>>2] & ~(0xffU << (((index) & 3) << 3))) + ((val) << (((index) & 3) << 3))
  #define PUTCHAR_G       PUTCHAR
  #define PUTCHAR_L       PUTCHAR

This may help other formats too.

sayan1an commented 9 years ago
- #if no_byte_addressable(DEVICE_INFO) || !defined(SCALAR) /* 32-bit stores */
+ #if no_byte_addressable(DEVICE_INFO) || !defined(SCALAR) || gpu_amd(DEVICE_INFO)
+ /* 32-bit stores */

Now committed. 71ff89dfca68ce

claudioandre-br commented 9 years ago

It seems mask mode is failing to compute the size of the task / percentage done.

[...]
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:34:22 86.76% (ETA: 18:23:30) [...]
0g 0:00:41:34 N/A [...]
magnumripper commented 9 years ago

It seems mask mode is failing to compute the size of the task / percentage done.

Yes I have seen that but haven't had time to dig into it. It only happens in some situations and I'm not sure which. Some examples that have to calculate differently are:

magnumripper commented 9 years ago

Right when exhausting eg. ?d?d?d?d?d?d?d?d?d?d (5-10s):

MD4 mask  Get key error! 824959999 824959999

(for some reason the figures vary, even when using same mask)

Apparently that "Get key error" almost always happens when exhausting a mask. Shouldn't this be pretty easy to mute?

claudioandre-br commented 9 years ago

That is very interesting... the only net change is the PUTCHAR macro.

Unfortunately, AMD compiler (only for GCN, for old hardware it behaves much better) act as a fortune teller. I mean, the way you write the code can interfere a lot with resulting performance (note, not the algorithm, the "style")

#define PUTCHAR(buf, index, val) (buf)[(index)>>2] = ((buf)[(index)>>2] & ~(0xffU << (((index) & 3) << 3))) + ((val) << (((index) & 3) << 3))

AMD compiler is eager to save temp values in registers and it uses a lot more registers.

#define PUTCHAR(buf, index, val) ((uchar*)(buf))[index] = (val)

Result in a good number of registers for a raw hash, but it spills.


claudioandre-br commented 9 years ago

That is very interesting

BTW: I would not say interesting, I would say creepy, daunting.

magnumripper commented 9 years ago

Creepy is the word! I would have bet all my money on the longer macro resulting in register spilling rather than the shorter "cast" one.

To complete the weirdness, with some kernels the cast macro is better (still on GCN): Especially PBKDF2 and iterated one-block sha1's (eg. office formats) if memory serves me. And likely RAR3 too.

claudioandre-br commented 9 years ago

I saw this sometimes (only on super's Tahiti). Might be nothing, but, since you both are using 0x80 as a special marker, I'm documenting it.

Self test failed (max. length in index 1: wrote 55, got 55 back: 
0X80_IS_NOT_EOW�SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS 
0X80_IS_NOT_EOW�BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB)
magnumripper commented 9 years ago

That test was introduced to stop people from writing functions that scan a Merkel Damgard buffer from start of word and treat first 0x80 as End of Word - which is totally buggy. You may do it that way only if you ensure that the next byte is 0x00.

But this was probably some other problem. What format was this?

claudioandre-br commented 9 years ago

Might be related to the get_key() / set_key() misbehavior seen in some circumstances.

GWS=262144 ../run/john ~/testhashes -form=raw-sha256-opencl -mask:?l?l?l?l?l?l?a -dev=0  -verb=5
magnumripper commented 9 years ago

So how do we handle UTF-16 formats with GPU-side mask? CPU-side mask can use an intermediate encoding, so mask mode is 8-bit using a specific code-page. Then that's fed to the format which does the conversion accordingly if needed.

We do have code for GPU-side conversions of codepage/UTF-8 to UTF-16 but this will slow down the very fastest of formats (ie. NT). The alternative is to still use an intermediate encoding's charset, but have it as already-prepared UTF-16 on GPU side. So an "a" in ?l would be 0x0061 and a Euro sign in ?s would be 0x20ac.

sayan1an commented 9 years ago

My understanding of UTF-16 is somewhat fuzzy at the moment and I require some guidance regarding handling of UTF-16 characters.

  1. What is the range of values covered by the set '?s' in UTF-16 mode?
  2. How do we specify custom set int UTF-16, like ?1 or ?2 etc ?
  3. How do we handle UTF-16 on CPU side ? One UTF-16 char as two UTF-8/ASCII chars ? On GPU, I suppose it should be implementation dependent that could vary from format to format.
  4. Can you give me an example of UTF-16 mask ? I would like to tinker with it to better understand how it is handled on cpu side so that we can make a decision regarding GPU side mask.
  5. What is codepage conversion ? Specifically, what is the bit pattern of string say "bit" in UTF-16 and after codepage conversion to UTF-8 in little endian?