Closed magnumripper closed 10 years ago
Fixed in a2c25dc, and passes with no problems after magnumripper/JohnTheRipper@da1074c0
I am working at using this dictionary for other utf8 encodings.
One thing I am finding, is we have pretty much hard coded max size utf8 char at 3 bytes. There is code in many places (I am starting with dyna). In dyna, I have *=3. In format self test, we have ml/=4 (max length). I am not sure how wide ranging the 3 byte limitation is, but I bet it is very invasive, probably trivial, but invasive.
I started with dyna_29. I chopped off the long pure ascii strings from sapf_utf8.dic. I then had to unique the file (there are dupes in it). When I built a 1500 line input file, with 27 char max, I only process 1493 of them. I have the 7 left, and they are 4 byte values, that are long. I am still investigating why these fail. Once I get this working here (dyna_29 sse build), I will have a better idea on just what steps will be required to make JtR utf-8 work properly with 4 byte utf-8 chars.
The x3 normally supports 4-byte characters too. For example, the real limit for an NT hash is 27 shorts. Three-byte UTF-8 will be one short and that is the dimensioning figure. Four-byte UTF-8 will be two shorts because UTF-16 can not cover such characters without itself resorting to "multi-short" aka surrogates. So even with 4-byte UTF-8 our bumped plaintext length will be enough for max length (which is not 27 then, but 13 and a half :-)
These 4 byte utf8 are much harder. They seem to require 2 utf-16 elements. I might have to make changes in my perl script to properly handle these. The 7 that are not handled are due to being <= 27 unicode characters, BUT taking more than 27 UTF16 characters to handle.
Uggg.
You are right here. I think the change will have to be done in the pass_gen.pl level. If I specify -maxlen=27 for utf-8, that really means max length = 54 bytes NO MATTER what. So within pass_gen.pl, we will have to be able to figure out how many bytes each character will require in the end. Some take 2 bytes, some take 4.
But I think JtR requires no changes.
Actually, if I do not 'fix' pass_gen, then I will simply pull out the 'failures' by hand (I have done this before). Or I make the input file large enough for 1500 non-SSE hashes, get it working properly at 1500, and then make sure that the 'right' count is determined for the narrow SSE buffers.
In pass_gen.pl, I think we only have to 'fix' this line:
my $linelen = length($);
I will have to write a my_length($1) function, that handles these 4 byte characters properly if in utf-8 mode. Actually, I think instead of character count (27 utf-16), we should instead look for byte count, (i.e. if it fits in 54 bytes, etc). That way a mix of 2 and 4 byte items would be dealt with properly.
I put this line into the pass_gen.pl file:
print "len=".length($)." from ${}\n";
then when run, I get this:
$ ./pass_gen.pl -utf8 -count 1500 -maxlen 27 -dictfile sapf_utf8.dic dynamic=29 < ../jtr-ts/missed.dic
#!comment: Built with pass_gen.pl using -codepage-UTF-8 mode, 0 to 27 characters. dict file=sapf_utf8.dic
** Here are the hashes for format **
len=16 from _𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u0-dynamic_29:$dynamic_29$232fe2dbf22e57e63442c8b7c1c97f0a:0:0:𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=17 from _𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u1-dynamic_29:$dynamic_29$e08f57223c1cb0f69939a0e22fba16b9:1:0:𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=20 from _𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒_
u2-dynamic_29:$dynamic_29$a5df3aa73256944f47a0ab36b9a5e812:2:0:𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒::
len=20 from _𢳂𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍_
u3-dynamic_29:$dynamic_29$6ab0a31c20266b40cc387649bcffe5c8:3:0:𢳂𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍::
len=14 from _𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u4-dynamic_29:$dynamic_29$c167f5f522d77f231814ffe1b4eddb0e:4:0:𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=20 from _𠜎𠜱𠝹𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u5-dynamic_29:$dynamic_29$c45ce126efdbeafb4a0a5390627a7f69:5:0:𠜎𠜱𠝹𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=20 from _𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒𩶘_
u6-dynamic_29:$dynamic_29$669ee0b6ba7da23f3dbc946913e6ef84:6:0:𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒𩶘::
So we can see that length($_) is way under 27 characters for each of these problem 4 byte unicode chars. I think I can get this 'fixed'.
This is not a super easy thing to handle. Here are what we deal with:
I have built this function (for pass_gen.pl). Can you think of a better way to do this? It returns the number of utf-16 (2 byte) characters required. It is a replacement for length($s)
sub jtr_unicode_corrected_length {
my $base_len = length($_[0]);
if ($arg_codepage ne "UTF-8") { return $base_len; }
# ok, we need to check each letter, and see if it takes 4 bytes to store. If so,
# then we add an extra character charge against that char (from 1 to 2 utf-16
# characters). All 1 or 2 byte characters were already handled by length() call.
my $final_len = $base_len;
for (my $i = 0; $i < $base_len; $i += 1) {
my $s = substr($_[0], $i, 1);
my $ch_bytes = Encode::encode_utf8($s);
if (length($ch_bytes) == 4) { $final_len += 1; }
}
return $final_len;
}
This 'seems' to work properly:
len=16 mylen=32 from _𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u0-dynamic_29:$dynamic_29$232fe2dbf22e57e63442c8b7c1c97f0a:0:0:𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=17 mylen=34 from _𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u1-dynamic_29:$dynamic_29$e08f57223c1cb0f69939a0e22fba16b9:1:0:𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=20 mylen=40 from _𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒_
u2-dynamic_29:$dynamic_29$a5df3aa73256944f47a0ab36b9a5e812:2:0:𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒::
len=20 mylen=40 from _𢳂𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍_
u3-dynamic_29:$dynamic_29$6ab0a31c20266b40cc387649bcffe5c8:3:0:𢳂𢴈𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍::
len=14 mylen=28 from _𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u4-dynamic_29:$dynamic_29$c167f5f522d77f231814ffe1b4eddb0e:4:0:𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=20 mylen=40 from _𠜎𠜱𠝹𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭_
u5-dynamic_29:$dynamic_29$c45ce126efdbeafb4a0a5390627a7f69:5:0:𠜎𠜱𠝹𠱓𠱸𠲖𠳏𠳕𠴕𠵼𠵿𠸎𠸏𠹷𠺝𠺢𠻗𠻹𠻺𠼭::
len=20 mylen=40 from _𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒𩶘_
u6-dynamic_29:$dynamic_29$669ee0b6ba7da23f3dbc946913e6ef84:6:0:𢵌𢵧𢺳𣲷𤓓𤶸𤷪𥄫𦉘𦟌𦧲𦧺𧨾𨅝𨈇𨋢𨳊𨳍𨳒𩶘::
len=6 mylen=6 from _burçin_
u7-dynamic_29:$dynamic_29$2264716d785ebbdcf6e5c7d416154aca:7:0:burçin::
The last one is a string with 5 ascii bytes, and 1 utf-8 that fits in 1 utf-16 char. The first set take 4 bytes for each character.
Magnum, what is your opinion? I think this is the right way to go. JtR requires no changes for these longer characters, IF we use the encoding functions.
I changed if(length($ch_bytes)==4) into if(length($ch_bytes)>3)
I have commited 3b7dbf9 pass_gen.pl to compute string lengths in 'JtR' friendly manner. The lengths for UTF-8 mode still 'want' number of characters. However, the larger 4 byte characters get charged as 2 characters, since they use twice the memory. Now a string with 10 ascii chars, 10 3 byte utf-8 and 6 4 byte utf-8 will be listed as NOT able to fit into a 27 char buffer. That string will take 32 characters to hold, and not 26 as the 'length()' function would have claimed.
I will carry forward using the sapf_utf8.dic file adding new test cases using it.
From a quick read I see nothing wrong with what you do. Should work like a champ.
Sooner or later though we'll see formats that use UTF-32 internally, and no surrogates. We have functions for that in unused/ so JtR itself will get that support once it's actually needed. For such formats we should just count "characters" again. But let's ignore this until it actually happens.
I know this is not a 'forever' fix. It simply makes things work with the current internal behavior of JtR. The --maxlen=x is something that is directly coded for internal JtR specifics anyway.
I have made some dyna_29 and dyna_33. I built the files to use 40 utf-8 2 byte short values (that is the length that dyna has set for max on the non-SIMD code). These input files find 1500 for non-SIMD builds, but only find 1416 for SIMD. The 84 'missed' do not fit into a singleton 55 byte usable buffer space. Also, we could easily bump the non-SIMD lengths of these to 60 utf8 'chars' (120 bytes). I have not done that, but we 'could' do this. We will start to bump into JtR's 125 character input limitations however, since 3*60 surpasses that. I think that is where the 40 utf-8 'char' limitation was created. I may be behind the times and this limit may have been lifted, but at one time I think this was pretty much 'the' limit for john.
Frank has provided good tests for 1-4 byte UTF-8 up to maximum length. Wrap them up and use them.