GPU mask mode, general discussion

magnumripper commented 9 years ago

For future experimental/incomplete commits (eg. when taking a stab at some UTF-16 format), please use the topic branch again so we keep the amount of likely-broken code in bleeding-jumbo as low as possible. But let's merge things as soon as they seem stable so it doesn't diverge too much. Also, we often don't find out about problems until we actually merge (#1036 is a good example).

Let's use this issue for general discussion. For specific problems we should create separate issues.

magnumripper commented 9 years ago

What is the range of values covered by the set '?s' in UTF-16 mode?

Good question, that could be 100,000 characters. I think it's best to keep using an "internal encoding" from the user perspective even though we are not really limited to it. So it'll be like this:

A. User picks (or has as a default) internal encoding CP1234. B. User picks a mask of ?s. C. Mask mode decides ?s for CP1234 is [range] (a string encoded in CP1234). D. GPU part of mask mode gets that same range "string" - but encoded as UTF-16 (possibly it doesn't resemble a string at all, it could be in any format we chose).

Our current CPU-side mask mode works just like that except (D).

How do we specify custom set int UTF-16, like ?1 or ?2 etc ?

Assuming we go with my answer above, same applies here. We do exactly as we do today, using the internal encoding, and then convert it to a set of UTF-16 code points.

How do we handle UTF-16 on CPU side ? One UTF-16 char as two UTF-8/ASCII chars ? On GPU, I suppose it should be implementation dependent that could vary from format to format.

Command line is either decoded as UTF-8 or a code page, depending on john.conf or command line encoding settings. Even when using UTF-8 we have the notion of an internal encoding (which defaults to ISO-8859-1). So everything is quite normal cstrings up to and including with the format's set_key().

A UTF-16 format (eg. NT) should be "encoding aware". It knows the string sent to set_key() is to be decoded and converts it to a "string" of unsigned shorts. For UTF-8 this is done using a (fast and simple) function, for legacy code pages it's a LUT. For the special case of ISO-8859-1 it's actually just a cast - the character 0xA3 (a pound sign) in UTF-16 is 0x00A3. BTW nt-opencl currently can't handle anything but ISO-8859-1. All other UTF-16 formats in Jumbo can, AFAIK.

Can you give me an example of UTF-16 mask ?

$ ../run/john -stdout -inp:utf8 -int:cp1252 -1:'[€$£]' -mask:?1 -max-len=1
€
$
£
3p 0:00:00:00 100.00% (2015-04-23 08:15) 13.04p/s £

Instead of -stdout, you can run the above with netntlmv2 or ntlmv2-opencl and follow what happens and where. The latter case fully works but is not very effective, it converts on GPU but transfer is a huge bottleneck. It should use GPU mask.

What is codepage conversion ? Specifically, what is the bit pattern of string say "bit" in UTF-16 and after codepage conversion to UTF-8 in little endian?

I don't quite understand your question. See unicode.c for shared generic functions. Also see iconv(1) for verifying stuff

$ echo müller | hd
00000000  6d c3 bc 6c 6c 65 72 0a                           |m..ller.|
00000008

$ echo müller | iconv -t cp1252 | hd
00000000  6d fc 6c 6c 65 72 0a                              |m.ller.|
00000007

$ echo müller | iconv -t utf-16le | hd
00000000  6d 00 fc 00 6c 00 6c 00  65 00 72 00 0a 00        |m...l.l.e.r...|
0000000e

magnumripper commented 9 years ago

@Sayantan2048 can we not get rid of this warning that is printed whenever keyspace is exhausted?

Get key error! 90249 90249

It's confusing. If it's needed as an assertion we should try to tweak it so it's muted for the normal no-problem situation.

magnumripper commented 9 years ago

@Sayantan2048 I committed a first version of full Unicode support for NT-opencl in 9cecf81. This version doesn't change the underlying functions - it just decodes on GPU as needed.

Cases like this one work fine (with any supported encoding):

$ ../run/john -form:nt-opencl -dev=2 test.in -enc:utf8 -int:latin1 -mask:?l?L?l?ler
Device 2: Tahiti [AMD Radeon HD 7900 Series]
Rules/masks using ISO-8859-1
Loaded 1 password hash (nt-opencl, NT [MD4 OpenCL])
Press 'q' or Ctrl-C to abort, almost any other key for status
möller           (u0)
1g 0:00:00:01  0.7092g/s 436283p/s 436283c/s 436283C/s #ö##er..#ÿ##er

This fails (UTF-8 character preceeding mask place-holders, and no internal encoding):

$ ../run/john -form:nt-opencl -dev=6 test.in -enc:utf8 -mask:mö?l?ler
Device 6: GeForce GTX TITAN X
Loaded 1 password hash (nt-opencl, NT [MD4 OpenCL])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:01  0g/s 352.0p/s 352.0c/s 352.0C/s GPU:39°C util:46% fan:22% mö##er..????????
Session completed

The reason it fails, is mask mode thinks the first ?l should be inserted at pos. 3 (starting from 0) becase it thinks 'ö' is two characters (it is indeed two bytes). It should be inserted at pos. 2.

As soon as you add --internal-encoding the problem goes away:

$ ../run/john -form:nt-opencl -dev=6 test.in -enc:utf8 -mask:mö?l?ler -int:cp1252
Device 6: GeForce GTX TITAN X
Rules/masks using ISO-8859-1
Loaded 1 password hash (nt-opencl, NT [MD4 OpenCL])
Press 'q' or Ctrl-C to abort, almost any other key for status
möller           (u0)
1g 0:00:00:02  0.5000g/s 338.0p/s 338.0c/s 338.0C/s GPU:39°C util:17% fan:22% mö##er..õõõõõõõõ
Use the "--show" option to display all of the cracked passwords reliably
Session completed

(EDIT: b2227bb makes sure you can't run the above without internal encoding)

Performance should be totally unaffected when not actually using UTF-8 or codepage (it actually builds different kernels). And even when you do, performance is still pretty good.

However, if we changed mask mode's int_keys to be an array of uint16_t instead of uint8_t, we would get rid of the codepage table lookups in inner loop.

magnumripper commented 9 years ago

performance is still pretty good.

Wow, on the Titan X performance is more or less unaffected even with codepage table lookups in the inner loop. I get over 15 Gp/s using LWS: 256, GWS: 36864 and mask mode either way.

On the 7970, performance drops from 7.5 Gp/s to 5 Gp/s when using internal encoding.

Maybe I should have implemented this prior to CMIYC-2015... :cry:

sayan1an commented 9 years ago

However, if we changed mask mode's int_keys to be an array of uint16_t instead of uint8_t, we would get rid of the codepage table lookups in inner loop.

Help me understand this and correct me where I'm wrong.

Mask mode internally only supports uint8 chars, and I beleive encoding requiring uint16 or uint32 are first converted to uint8 chars and then fed into mask mode. So if ö takes uint16, I suppsoe it is split into two uint8 chars, mask mode treats these chars as separate and assigns separate location for them to iterate over. However, NT must treat them as one char, and put them at same location assuming we're using uint16. I see we're using PUTSHORT macros which stuffs uint8 keys typecast as short bytes into nt_buffer. So, the two uint8 chars of ö ends up at different location within nt_buffer. When it should have shared one uint16 char, it is using two!! This is where I'm getting confused!!

Or better, please explain the whole chain of conversion going on starting with initial mask.

magnumripper commented 9 years ago

Here's our current chain (using an internal encoding) for a Euro sign:

Mask and placeholders are converted from UTF-8 to some internal 8-bit codepage (eg. ISO-8859-15 which has "€" as 0xa4) in mask_init.
Throughout core and mask mode, that "€" is obviously just a char like any other.
When we arrive at the PUTSHORT macros, we do a table lookup of cp[0xa4] for ISO-8859-15 which yields 0x20ac. The cp[] array is defined in opencl_unicode.h.

This works perfectly fine, except you need to find an internal encoding that holds any characters you will need (and it's not always possible - for example you might have problems finding a codepage that can hold a string containing a russian character and a Euro sign).

Unicode-aware mask mode would work like this:

Mask and placeholders are converted to UTF-32 in mask_init.
Everything works like today except all "strings", arrays and "chars" are made of uint instead of char/uchar - throughout mask mode. There are no variable lengths - all characters are 32 bits.
When we arrive at the PUTSHORT macros, for NT we'd just do (c & 0xffff) to get UCS-2. No table lookup. But for eg. raw-MD5 where we do want UTF-8 as target encoding, we'd have to convert to UTF-8 here. That is very cheap though. And actually I see a way to get rid of that too, but that's a later discussion.

But now we'd get the opposite problem for non-Microsoft formats actually using UTF-8 as target encoding: A "€" will expand to ~~two~~ three bytes at that time while an "a" would be just one.

magnumripper commented 9 years ago

But now we'd get the opposite problem for non-Microsoft formats actually using UTF-8 as target encoding: A "€" will expand to two bytes at that time while an "a" would be just one.

To get around this I guess you'd need to introduce another piece of data to the mask struct: "For position n our place-holder will eat m bytes" or something like that. So that one will depend on target encoding: In case of NT it will always be 1 (as in one uint16) while for raw-MD5 it will be 1 for ASCII characters and 2, 3 or 4 bytes for non-ASCII...

magnumripper commented 9 years ago

Hm no, that will not do. Consider the mask a[bö]c for UTF-8 target encoding. For first candidate, we need to have the initial word prepered as a#c and then we can insert the b. But for second candidate we'd need to have the initial word prepered as a##c to get room for the two-byte ö. For € it would even need to be a###c.

Maybe we should just stick to the internal encoding. It really solves most problems, but has its limitations.

frank-dittrich commented 9 years ago

In utf-8, € is represented by three bytes (<82>).

magnumripper commented 9 years ago

@Sayantan2048 you have implemented your own auto-tune in seven GPU-mask formats, with no shared code at all. That's the opposite direction from what I and @claudioandre has struggled with for a long time. I tried changing NT-opencl to use our shared auto-tune in 2772b8f26 and I see no downsides (actually it seems to work better). I'm planning to do the same with the six others. I can understand if mscash2 needs its own auto-tune (for multi device support) but for other formats it's really the wrong way to go.

Once we use shared code it'll be a much simpler task implementing GPU-mask-autotune.

sayan1an commented 9 years ago

Thank you, it should and will work for raw hashes, however, for salted hashes, do they set valid salt before benchmark? I think you'll also run into problems with descrypt and lm-opencl, where kernels are rebuilt as neeeded during auto-tune.

sayan1an commented 9 years ago

To get around this I guess you'd need to introduce another piece of data to the mask struct: "For position n our place-holder will eat m bytes" or something like that. So that one will depend on target encoding: In case of NT it will always be 1 (as in one uint16) while for raw-MD5 it will be 1 for ASCII characters and 2, 3 or 4 bytes for non-ASCII...

As my initail thoughts, I think his complicates the GPU side mask!! For raw-md5, we'll need to do 1 or 2 or 3 or 4 putchar insted of one, if we treat every placeholder as UTF-32. Worst part is, it won't be SIMD friendly.

sayan1an commented 9 years ago

As my initail thoughts, I think his complicates the GPU side mask!! For raw-md5, we'll need to do 1 or 2 or 3 or 4 putchar insted of one, if we treat every placeholder as UTF-32. Worst part is, it won't be SIMD friendly.

Not exactly, but we'll need too many scalar branches which ideally shouldn't cause performance degradation but when put inside loops, they tend to perform very poorly.

magnumripper commented 9 years ago

we'll need too many scalar branches

Yeah I think we should stick to "internal encoding" for now. Simple is beautiful.

for salted hashes, do they set valid salt before benchmark? I think you'll also run into problems with descrypt and lm-opencl, where kernels are rebuilt as neeeded during auto-tune.

The shared auto-tune uses salt and ciphertexts from the test vectors so it should be fine. I wont touch DES or LM, I'm looking at fixing the following:

$ git grep -l "auto_tune("
opencl_mscash_fmt_plug.c
opencl_nsldap_fmt_plug.c
opencl_rawmd4_fmt_plug.c
opencl_rawmd5_fmt_plug.c
opencl_rawsha1_fmt_plug.c
opencl_salted_sha_fmt_plug.c

(I think nsldap and raw-sha1 will be merged in the process, like we did for CPU formats)

magnumripper commented 9 years ago

I did mscash and it seemed to work at first but something's not right with salts (as you said). Just what the heck are you doing in there? Are there good reasons for deviating from the specified interfaces? The shared auto-tune works like a champ for almost 50 formats, salted or not.

Reverted it while investigating.

magnumripper commented 9 years ago

OK, I think I get the picture. I probably wasn't very far from a working version. I'll continue later.

magnumripper commented 9 years ago

Works now. I will polish it a bit more.

@Sayantan2048 perhaps the shared code should include ability to build a "fake db" out of test vectors? Or would that be too format-specific?

BTW you really should move some of your dupe code for hash tables and stuff, to shared code or at least shared source (a C header).

magnumripper commented 8 years ago

I'm closing this generic issue now. We'll open specific issues when needed. I played a lot with GPU mask formats lately, including with various encodings and they work damn good and damn fast :+1:

openwall / john

GPU mask mode, general discussion #1037