openwall / john-tests

Test Suite for John the Ripper
24 stars 15 forks source link

Replace tests for SAP F/G #10

Closed magnumripper closed 10 years ago

magnumripper commented 10 years ago

Frank has provided good tests for 1-4 byte UTF-8 up to maximum length. Wrap them up and use them.

magnumripper commented 10 years ago

Fixed in a2c25dc, and passes with no problems after magnumripper/JohnTheRipper@da1074c0

jfoug commented 10 years ago

I am working at using this dictionary for other utf8 encodings.

One thing I am finding, is we have pretty much hard coded max size utf8 char at 3 bytes. There is code in many places (I am starting with dyna). In dyna, I have *=3. In format self test, we have ml/=4 (max length). I am not sure how wide ranging the 3 byte limitation is, but I bet it is very invasive, probably trivial, but invasive.

I started with dyna_29. I chopped off the long pure ascii strings from sapf_utf8.dic. I then had to unique the file (there are dupes in it). When I built a 1500 line input file, with 27 char max, I only process 1493 of them. I have the 7 left, and they are 4 byte values, that are long. I am still investigating why these fail. Once I get this working here (dyna_29 sse build), I will have a better idea on just what steps will be required to make JtR utf-8 work properly with 4 byte utf-8 chars.

magnumripper commented 10 years ago

The x3 normally supports 4-byte characters too. For example, the real limit for an NT hash is 27 shorts. Three-byte UTF-8 will be one short and that is the dimensioning figure. Four-byte UTF-8 will be two shorts because UTF-16 can not cover such characters without itself resorting to "multi-short" aka surrogates. So even with 4-byte UTF-8 our bumped plaintext length will be enough for max length (which is not 27 then, but 13 and a half :-)

jfoug commented 10 years ago

These 4 byte utf8 are much harder. They seem to require 2 utf-16 elements. I might have to make changes in my perl script to properly handle these. The 7 that are not handled are due to being <= 27 unicode characters, BUT taking more than 27 UTF16 characters to handle.

Uggg.

jfoug commented 10 years ago

You are right here. I think the change will have to be done in the pass_gen.pl level. If I specify -maxlen=27 for utf-8, that really means max length = 54 bytes NO MATTER what. So within pass_gen.pl, we will have to be able to figure out how many bytes each character will require in the end. Some take 2 bytes, some take 4.

But I think JtR requires no changes.

Actually, if I do not 'fix' pass_gen, then I will simply pull out the 'failures' by hand (I have done this before). Or I make the input file large enough for 1500 non-SSE hashes, get it working properly at 1500, and then make sure that the 'right' count is determined for the narrow SSE buffers.

jfoug commented 10 years ago

In pass_gen.pl, I think we only have to 'fix' this line:

my $linelen = length($);

I will have to write a my_length($1) function, that handles these 4 byte characters properly if in utf-8 mode. Actually, I think instead of character count (27 utf-16), we should instead look for byte count, (i.e. if it fits in 54 bytes, etc). That way a mix of 2 and 4 byte items would be dealt with properly.

I put this line into the pass_gen.pl file:

print "len=".length($)." from ${}\n";

then when run, I get this:

$ ./pass_gen.pl -utf8 -count 1500 -maxlen 27 -dictfile sapf_utf8.dic dynamic=29 < ../jtr-ts/missed.dic
#!comment: Built with pass_gen.pl using -codepage-UTF-8 mode, 0 to 27 characters. dict file=sapf_utf8.dic

  ** Here are the hashes for format  **
len=16  from _𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u0-dynamic_29:$dynamic_29$232fe2dbf22e57e63442c8b7c1c97f0a:0:0:𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=17  from _𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u1-dynamic_29:$dynamic_29$e08f57223c1cb0f69939a0e22fba16b9:1:0:𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=20  from _𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒_
u2-dynamic_29:$dynamic_29$a5df3aa73256944f47a0ab36b9a5e812:2:0:𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒::
len=20  from _𢳂𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍_
u3-dynamic_29:$dynamic_29$6ab0a31c20266b40cc387649bcffe5c8:3:0:𢳂𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍::
len=14  from _𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u4-dynamic_29:$dynamic_29$c167f5f522d77f231814ffe1b4eddb0e:4:0:𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=20  from _𠜎𠜱𠝹𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u5-dynamic_29:$dynamic_29$c45ce126efdbeafb4a0a5390627a7f69:5:0:𠜎𠜱𠝹𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=20  from _𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒𩶘_
u6-dynamic_29:$dynamic_29$669ee0b6ba7da23f3dbc946913e6ef84:6:0:𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒𩶘::

So we can see that length($_) is way under 27 characters for each of these problem 4 byte unicode chars. I think I can get this 'fixed'.

jfoug commented 10 years ago

This is not a super easy thing to handle. Here are what we deal with:

  1. in JtR --utf8 mode, we look at everything in a utf-16 manner.
  2. ascii letters are encoded as 2 bytes (utf-16)
  3. 4 byte > larger utf-8 get encoded as 2 utf-16's This is the problem.

I have built this function (for pass_gen.pl). Can you think of a better way to do this? It returns the number of utf-16 (2 byte) characters required. It is a replacement for length($s)

sub jtr_unicode_corrected_length { 
    my $base_len = length($_[0]);
    if ($arg_codepage ne "UTF-8") { return $base_len; }
    # ok, we need to check each letter, and see if it takes 4 bytes to store.  If so,
    # then we add an extra character charge against that char (from 1 to 2 utf-16
    # characters).  All 1 or 2 byte characters were already handled by length() call.
    my $final_len = $base_len;
    for (my $i = 0; $i < $base_len; $i += 1) {
        my $s = substr($_[0], $i, 1);
        my $ch_bytes = Encode::encode_utf8($s);
        if (length($ch_bytes) == 4) { $final_len += 1; }
    }
    return $final_len;
}

This 'seems' to work properly:

len=16 mylen=32  from _𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u0-dynamic_29:$dynamic_29$232fe2dbf22e57e63442c8b7c1c97f0a:0:0:𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=17 mylen=34  from _𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u1-dynamic_29:$dynamic_29$e08f57223c1cb0f69939a0e22fba16b9:1:0:𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=20 mylen=40  from _𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒_
u2-dynamic_29:$dynamic_29$a5df3aa73256944f47a0ab36b9a5e812:2:0:𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒::
len=20 mylen=40  from _𢳂𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍_
u3-dynamic_29:$dynamic_29$6ab0a31c20266b40cc387649bcffe5c8:3:0:𢳂𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍::
len=14 mylen=28  from _𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u4-dynamic_29:$dynamic_29$c167f5f522d77f231814ffe1b4eddb0e:4:0:𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=20 mylen=40  from _𠜎𠜱𠝹𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u5-dynamic_29:$dynamic_29$c45ce126efdbeafb4a0a5390627a7f69:5:0:𠜎𠜱𠝹𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=20 mylen=40  from _𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒𩶘_
u6-dynamic_29:$dynamic_29$669ee0b6ba7da23f3dbc946913e6ef84:6:0:𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒𩶘::
len=6 mylen=6  from _burçin_
u7-dynamic_29:$dynamic_29$2264716d785ebbdcf6e5c7d416154aca:7:0:burçin::

The last one is a string with 5 ascii bytes, and 1 utf-8 that fits in 1 utf-16 char. The first set take 4 bytes for each character.

Magnum, what is your opinion? I think this is the right way to go. JtR requires no changes for these longer characters, IF we use the encoding functions.

jfoug commented 10 years ago

I changed if(length($ch_bytes)==4) into if(length($ch_bytes)>3)

jfoug commented 10 years ago

I have commited 3b7dbf9 pass_gen.pl to compute string lengths in 'JtR' friendly manner. The lengths for UTF-8 mode still 'want' number of characters. However, the larger 4 byte characters get charged as 2 characters, since they use twice the memory. Now a string with 10 ascii chars, 10 3 byte utf-8 and 6 4 byte utf-8 will be listed as NOT able to fit into a 27 char buffer. That string will take 32 characters to hold, and not 26 as the 'length()' function would have claimed.

I will carry forward using the sapf_utf8.dic file adding new test cases using it.

https://github.com/magnumripper/JohnTheRipper/commit/3b7dbf96ffe83442da22c50b55109bf4c3a0078a#diff-07a323cb1820e95e5c40b30048aa3ada

magnumripper commented 10 years ago

From a quick read I see nothing wrong with what you do. Should work like a champ.

Sooner or later though we'll see formats that use UTF-32 internally, and no surrogates. We have functions for that in unused/ so JtR itself will get that support once it's actually needed. For such formats we should just count "characters" again. But let's ignore this until it actually happens.

jfoug commented 10 years ago

I know this is not a 'forever' fix. It simply makes things work with the current internal behavior of JtR. The --maxlen=x is something that is directly coded for internal JtR specifics anyway.

I have made some dyna_29 and dyna_33. I built the files to use 40 utf-8 2 byte short values (that is the length that dyna has set for max on the non-SIMD code). These input files find 1500 for non-SIMD builds, but only find 1416 for SIMD. The 84 'missed' do not fit into a singleton 55 byte usable buffer space. Also, we could easily bump the non-SIMD lengths of these to 60 utf8 'chars' (120 bytes). I have not done that, but we 'could' do this. We will start to bump into JtR's 125 character input limitations however, since 3*60 surpasses that. I think that is where the 40 utf-8 'char' limitation was created. I may be behind the times and this limit may have been lifted, but at one time I think this was pretty much 'the' limit for john.