Add FMT_UTF16 format flag

magnumripper commented 9 years ago

This flag indicates that set_key() is to take a ~~UTF32*~~ UTF16* instead of a char* pointer to key. Formats like NT and Oracle will set it.

For eg. Incremental mode using UTF-32 internally, this means we don't have to go through (slow) UTF-8 conversion just to satisfy the legacy set_key() prototype.

magnumripper commented 9 years ago

See ML discussion: http://www.openwall.com/lists/john-dev/2015/08/09/4

jfoug commented 9 years ago

I thought everything in jtr was UTF-16 (actually UCS2 in many places). Why the UTF-32 ?

magnumripper commented 9 years ago

UTF-32 is the only sane choice. We definitely don't want to handle UTF-16 surrogates within incremental or rules, that's almost as bad as actually processing UTF-8.

The UTF-32 -> UTF-16 conversion (eg. in NT format) will be very easy and very fast. If we (optionally) just support UCS-2 for speed, it's merely an int->short downcast similar to the original char->short upcast in speed.

frank-dittrich commented 9 years ago

I thought everything in jtr was UTF-16 (actually UCS2 in many places). Why the UTF-32 ?

Word lists which mostly contain ascii would also get 4 times as large instead of just 2 times.

jfoug commented 9 years ago

Word lists which mostly contain ascii would also get 4 times as large instead of just 2 times.

Only internally within the memory footprint of john. The external file would be the same.

I am still not 100% sold on everything runs as UTF-32. I find that to be wasteful overkill if the file really is just ascii 7 bit, or 99% ascii 7bit.

magnumripper commented 9 years ago

We could try to fit both 32-bit and 8-bit code in there simultaneously but it will be significantly harder and a LOT messier. I think the current code stretches the "don't go 32-bit" as far as sensible but the next step is to bite the bullet and go all in.

jfoug commented 9 years ago

We will still need some of the code page support, but hopefully it will be far less encompassing, and will be localized in the code for wordlist.c or loader. We will only need to have character mapping. NOT any of the classification / or casing logic. But adding mapping may be VERY easy, at least for code pages in perl. Here is some code I did to grab the utf8 bytes from code pages supported by perl.

#!/usr/bin/perl
use Encode;
my $s;
my $i = 0;
for (; $i < 0x100; ++$i) {
    $s = chr($i);
    my $cp = decode($ARGV[0], $s);
    my $final = encode('UTF-8', $cp);
    # do not print character if it is a control char (ord < 31)
    if (defined($final) && length($final)>0 && ord($final) > 31) {
        # print final on separate print statement. This is due to
        # epsadic, since \n is NOT in slot 0x0a. So if we concate
        # \n to the final string, and it is an epsidic string, then
        # we will NOT have a split file with a character on each line.
        print $final;
        print "\n";
    }
}
print STDERR "code page $ARGV[0] handled\n";

and here are 'most' of the code pages handled by Perl (including epsidic, which are a bit ugly since \n is not where you think it should be, lol)

#!/bin/sh
rm -f cp
rm -f cp-chars-all
./cpgen.pl cp37 >> cp
./cpgen.pl cp424 >> cp
./cpgen.pl cp437 >> cp
./cpgen.pl cp500 >> cp
./cpgen.pl cp737 >> cp
./cpgen.pl cp775 >> cp
./cpgen.pl cp850 >> cp
./cpgen.pl cp852 >> cp
./cpgen.pl cp855 >> cp
./cpgen.pl cp856 >> cp
./cpgen.pl cp857 >> cp
./cpgen.pl cp858 >> cp
./cpgen.pl cp860 >> cp
./cpgen.pl cp861 >> cp
./cpgen.pl cp862 >> cp
./cpgen.pl cp863 >> cp
./cpgen.pl cp864 >> cp
./cpgen.pl cp865 >> cp
./cpgen.pl cp866 >> cp
./cpgen.pl cp869 >> cp
./cpgen.pl cp874 >> cp
./cpgen.pl cp875 >> cp
./cpgen.pl cp932 >> cp
./cpgen.pl cp936 >> cp
./cpgen.pl cp949 >> cp
./cpgen.pl cp950 >> cp
./cpgen.pl cp1006 >> cp
./cpgen.pl cp1026 >> cp
./cpgen.pl cp1047 >> cp
./cpgen.pl cp1250 >> cp
./cpgen.pl cp1251 >> cp
./cpgen.pl cp1252 >> cp
./cpgen.pl cp1253 >> cp
./cpgen.pl cp1254 >> cp
./cpgen.pl cp1255 >> cp
./cpgen.pl cp1256 >> cp
./cpgen.pl cp1257 >> cp
./cpgen.pl cp1258 >> cp
./cpgen.pl iso-8859-1 >> cp
./cpgen.pl iso-8859-2 >> cp
./cpgen.pl iso-8859-3 >> cp
./cpgen.pl iso-8859-4 >> cp
./cpgen.pl iso-8859-5 >> cp
./cpgen.pl iso-8859-6 >> cp
./cpgen.pl iso-8859-7 >> cp
./cpgen.pl iso-8859-8 >> cp
./cpgen.pl iso-8859-9 >> cp
./cpgen.pl iso-8859-10 >> cp
./cpgen.pl iso-8859-11 >> cp
./cpgen.pl iso-8859-13 >> cp
./cpgen.pl iso-8859-14 >> cp
./cpgen.pl iso-8859-15 >> cp
./cpgen.pl iso-8859-16 >> cp
./cpgen.pl ascii >> cp
./cpgen.pl US-ascii >> cp
./cpgen.pl ISO-646-US >> cp
./cpgen.pl ISO-646 >> cp
./cpgen.pl ascii-ctrl >> cp
./cpgen.pl latin1 >> cp
./cpgen.pl AdobeStandardEncoding >> cp
./cpgen.pl MacRoman >> cp
./cpgen.pl nextstep >> cp
./cpgen.pl hp-roman8 >> cp
./cpgen.pl MacCentralEurRoman >> cp
./cpgen.pl MacCroatian >> cp
./cpgen.pl MacRomanian >> cp
./cpgen.pl MacRumanian >> cp
./cpgen.pl Latin3 >> cp
./cpgen.pl Latin4 >> cp
./cpgen.pl MacCyrillic >> cp
./cpgen.pl MacUkrainian >> cp
./cpgen.pl Arabic >> cp
./cpgen.pl MacArabic >> cp
./cpgen.pl MacFarsi >> cp
./cpgen.pl Greek >> cp
./cpgen.pl MacGreek >> cp
./cpgen.pl Hebrew >> cp
./cpgen.pl MacHebrew >> cp
./cpgen.pl MacTurkish >> cp
./cpgen.pl MacIcelandic >> cp
./cpgen.pl MacSami >> cp
./cpgen.pl Thai >> cp
./cpgen.pl MacThai >> cp
./cpgen.pl Latin9 >> cp
./cpgen.pl Latin10 >> cp
./cpgen.pl viscii >> cp
./cpgen.pl koi8-f >> cp
./cpgen.pl koi8-r >> cp
./cpgen.pl koi8-u >> cp
./cpgen.pl gsm0338 >> cp
./cpgen.pl euc-cn >> cp
./cpgen.pl gbk >> cp
./cpgen.pl gb12345-raw >> cp
./cpgen.pl gb2312-raw >> cp
./cpgen.pl hz >> cp
./cpgen.pl iso-ir-165 >> cp
./cpgen.pl euc-jp >> cp
./cpgen.pl shiftjis >> cp
./cpgen.pl macJapanese >> cp
./cpgen.pl 7bit-jis >> cp
./cpgen.pl iso-2022-jp >> cp
./cpgen.pl iso-2022-jp-1 >> cp
./cpgen.pl jis0201-raw >> cp
./cpgen.pl jis0208-raw >> cp
./cpgen.pl jis0212-raw >> cp
./cpgen.pl euc-kr >> cp
./cpgen.pl iso-2022-kr >> cp
./cpgen.pl johab >> cp
./cpgen.pl ksc5601-raw >> cp
./cpgen.pl big5-eten >> cp
./cpgen.pl MacChineseTrad >> cp
./cpgen.pl big5 >> cp
./cpgen.pl big5-hkscs >> cp
./cpgen.pl posix-bc >> cp
./cpgen.pl symbol >> cp
./cpgen.pl dingbats >> cp
./cpgen.pl MacDingbats >> cp
./cpgen.pl AdobeZdingbat >> cp
./cpgen.pl AdobeSymbol >> cp
./cpgen.pl GB2312 >> cp
./cpgen.pl macarabic >> cp
./cpgen.pl macgreek >> cp
./cpgen.pl machebrew >> cp
./cpgen.pl macthai >> cp
./cpgen.pl macturkish >> cp
./cpgen.pl macjapanese >> cp
./cpgen.pl mackorean >> cp
./cpgen.pl Cyrillic >> cp
./cpgen.pl macCyrillic >> cp
./cpgen.pl ISO-8859-8 >> cp
./cpgen.pl macThai >> cp
./cpgen.pl US-ASCII >> cp
./cpgen.pl Shift_JIS >> cp
./cpgen.pl EUC-JP >> cp
./cpgen.pl ISO-2022-JP >> cp
./cpgen.pl ISO-2022-JP-1 >> cp
./cpgen.pl EUC-KR >> cp
./cpgen.pl Big5 >> cp
./cpgen.pl GB_2312-80 >> cp
./cpgen.pl EUC-CN >> cp
./cpgen.pl KOI8-U >> cp
./cpgen.pl KOI8-r >> cp
./cpgen.pl KS_C_5601-1987 >> cp
./cpgen.pl ISO-IR-165 >> cp
./cpgen.pl VISCII >> cp
./cpgen.pl UHC >> cp
./cpgen.pl x-windows-949 >> cp
./cpgen.pl GBK >> cp
./cpgen.pl SJIS >> cp
./cpgen.pl CP932 >> cp
./cpgen.pl Windows-31J >> cp
./cpgen.pl Symbol >> cp

run/unique -inp=cp cp-chars-all

magnumripper commented 9 years ago

I'm not quite following what you did with that perl script.

The code page support will be needed for reading files, and for any target encoding used. Example

UTF-8 wordlist -> rules -> filters -> crk_set_key() -> LM format set_key() (using cp)

The UTF-8 will be converted to UTF-32 in wordlist.c, then stay UTF-32 all the way until cracker.c is about to call format's set_key(). Just before that, it needs to convert to eg. CP850. Note that in this very case, the current code is probably much more efficient.

frank-dittrich commented 9 years ago

We have to handle the case that the target code page doesn't include characters that might be read from a wordlist. OTOH, we do have that problem right now as well. How is that handled?

magnumripper commented 9 years ago

That is an issue now and will be no matter how we re-write this. It does, and always will, result in garbage candidates.

If we at all support reading non-UTF-8 wordlists, the same applies there: What if it's written in a non-supported codepage like ArmSCII-8? Like today, we'd need to run transparent. A sensible way of handling it would be "convert assuming ISO-8859-1", "process as ASCII" [i.e. ignore non-ASCII when case-toggling and so on] and "convert back assuming ISO-8859-1". This will work pretty much the same as current -enc=raw I think.

jfoug commented 9 years ago

If we at all support reading non-UTF-8 wordlists

?? I would think forcing users to use iconv to put their wordlists into utf8 would be one option, but I bet users would NOT like that much. The main problem is 'shit' wordlists, that are a hodge podge of mixed character sets. Those dirty wordlists still abound around the net, and people like to use them. Yes, if every word in there was properly converted to utf8, then wow, it would be SO much better. But how do we help get from point dirty to point utf8 clean ? Or do we simply not care, and tell users that the word lists need to be in utf8 so there is no ambiguity within JtR?

magnumripper commented 9 years ago

I think we should support it, but we could opt to not supporting it.

For mixed-encodings wordlists, that "transparent mode" concept is mandatory. However, it won't work well for Unicode hashes like NT. I never has and never will, in any cracker - it simply can't.

BTW a problem with "transparent mode" is your pot file entries will not be UTF-8. Ideally we should have a field in the .pot file stating this is the case, and for such entries -show would print eg.

Administrator:M.ller [4d 81 6c 6c 65 72]

That's "Müller" in CP850.

magnumripper commented 9 years ago

I'm digressing now, but ideally the pot file format would always be

<hash> : <encoding> : <hexdump of plain in target encoding>

That would work with tabs, colons and whatever, and with any encoding including transparent [== raw == unknown] encoding. It would always be totally reproducible.

jfoug commented 9 years ago

We may want to have the .pot file enhanced if we allow anything OTHER than utf8 to be stored in it. The enhancement would be to somehow deliver across information about the encoding.

Lol, you beat me to the punch

magnumripper commented 9 years ago

I have considered various ideas over time:

Simply change the pot file format, adding fields and functionality. Possibly also start defaulting to using TAB as separator. Or simply use XML or CSV (but quoting will be hell). Actually, the "hex dump" approach as above is the safest and simplest (and we could add more fields to it if we wanted).
Use a different pot file for raw mode. This will be transparent for the user, she won't notice the difference. Known encodings will be (by default) in john.pot and raw stuff will be in john.raw.pot - or something like that. When you use -show, it will read both files.
A variation of (2): Keep current john.pot as today but add more information in a second file. This is tricky and error prone.

jfoug commented 9 years ago

I actually REALLY like that .pot (.pv2 ?) file layout. It is absolutely concise. If we changed it to this:

hash : \t $V2$ \t encoding \t plain-in-hex : plain

Then we would not have to change the .pot file at all. We could still read the legacy stuff, but if the line contains $V2$ as the found password, and has \tsome-valid-encoding\t following that, then we know this is a V2 line, and handle it appropriately and THEN the .pot file actually could have 'mixed' cp data, have the "REAL" data in the file, etc. Actually, would we really 'need' the plain-in-hex ?

magnumripper commented 9 years ago

For transparent mode, the plain hex is the only thing that will ~~ever~~ always be proper.

jfoug commented 9 years ago

I would really poo-poo XML. Yes it you can do anything with it, BUT it is slow, and is a nightmare for stuff like this, where quoting will have to be done all over the place.

CSV is no better (we already are CSV, but with 2 fields, and the separator being the 'first' ':' seen on the line.

magnumripper commented 9 years ago

Also, consider the case where a user fed a CP123 wordlist but stated CP234, and by coincidence some LM hashes was cracked. That plaintext-as-hex will show what was ACTUALLY the password, no matter the encoding, as recorded, was incorrect (and as a result, a Unicode print from -show will be incorrect too).

Yes, I hate XML. And the current format is just a variation of CSV.

jfoug commented 9 years ago

Should these last few comments be moved to it's own topic, Possibly as an RFC type ?

magnumripper commented 9 years ago

We should probably have discussed all this on john-dev instead...

magnumripper commented 9 years ago

I wish john-dev was a web forum, with GitHub markup :smile:

jfoug commented 9 years ago

It is not bad to get things pie in the sky talked about offline, then bring to john-dev, but usually only crickets chirp there.

jfoug commented 9 years ago

I hate the email lists. So difficult to find anything. Yes, the github is also a cluster, trying to find old stuff, BUT for doing 'hot in the trenches' stuff, it makes following along very easy

magnumripper commented 9 years ago

This new flag should be named FMT_UTF32 and not only set_key() should be affected but also get_key().

magnumripper commented 9 years ago

I think this also obsoletes both FMT_UNICODE and FMT_UTF8 flags.

magnumripper commented 9 years ago

So, if set_key can be either of

static void set_key(char *key, int index);

and

static void set_key(UTF32 *key, int index);

What is the canonical way to declare it? I see at least two solutions. One is to declare it as void pointer in formats.h:

static void set_key(void *key, int index);

and the other is to actually add a function to the struct,

static void set_key(char *key, int index);
static void set_key32(UTF32 *key, int index);

If we go with the latter, any format will probably have one of them as NULL and the other one defined. Actually that would mean we don't strictly need the FMT_UTF32 flag, we could just test if fmt->methods.set_key32 is NULL or not.

magnumripper commented 9 years ago

We have to handle the case that the target code page doesn't include characters that might be read from a wordlist. OTOH, we do have that problem right now as well. How is that handled?

That is an issue now and will be no matter how we re-write this. It does, and always will, result in garbage candidates.

On a second thought: With the proposed changes, we'd detect in cracker.c that the target encoding can't hold the needed characters - so we could reject the candidate and never send it to the format. This is a good thing.

jfoug commented 9 years ago

So, if set_key can be either of

I think we need to step back a bit. We might want to totally relook at the interface. It may be time to provide an interface that the conversion code dumps data right into the key space (on set_key). Get_key IMHO is not hot code. Finding a crack is NOT expected, it is an exception. But setting keys IS hot code.

What is the canonical way to declare it?

Switch to a real language :smiling_imp: where issues like this are simply handled by the compiler.

jfoug commented 9 years ago

We might want to totally relook at the interface.

For example:

within converter (that marshals data from the crack type into the format), the format during init() would simply make 'setter' type calls into the converter, providing it with all the information that is required to properly load the data (number of buffers, layout of buffers, size of buffers, byte ordering of buffers, whether to MD the buffers, provided alignment, buffer cleaning required, etc). We would likely have to still provide the existing interface, where 1 word at a time is sent to the format to have it load into its memory layout, for formats which have complex (or extremely simple) data layouts. Formats that simply work on a single password at a time, could likely keep the existing naive setkey() layout.

magnumripper commented 9 years ago

Get_key IMHO is not hot code.

No but it's logical to return a key in same format you got it. And it simplifies format code, the conversion back to UTF-8 would be in cracker.c instead of duplicated in all Unicode formats.

magnumripper commented 9 years ago

BTW for putting code where most effective, we should actually have it as FMT_UTF16. This means set_key() and get_key() will set/get UTF-16 with surrogates - BUT all of core should still have it as UTF-32. This will mean really simple code in formats.

On another note, while doing all this we'll ensure the key delivered from cracker.c is aligned properly for any use (ie. aligned to 8) no matter if it's char* or UTF16*.

magnumripper commented 9 years ago

What is the canonical way to declare it?

Switch to a real language :smiling_imp: where issues like this are simply handled by the compiler.

Real code don't need a compiler, only an assembler :smiling_imp:

frank-dittrich commented 9 years ago

Why an assembler? http://www.catb.org/jargon/html/story-of-mel.html

openwall / john

Add FMT_UTF16 format flag #1631