Full UTF-32 support through-out Jumbo

magnumripper commented 9 years ago

Sub issues

[ ] #1627 (Incremental)
[ ] #1628 (Rules)
[ ] #1629 (Mask)
[ ] #1631 (FMT_UTF16)
[x] #1641 (external)
[ ] #1642 (wordlist)
[x] Subsets mode
[ ] none yet: Markov mode
[ ] none yet: (code page <-> unicode converter) Note: MB code pages are complex!

I'm trying to see the whole picture here. Let's say we start with saying that crk_set_key() takes UTF-32. This means all modes need to feed UTF-32 to it.

In that one place in cracker.c, the key will be converted to the target encoding (except for #1631).

After this, we'll just have to get all modes working. All of them should produce UTF-32. For Incremental, Mask and Markov, it's easy. Prince mode is somewhat special (it actually handles UTF-8 without a flaw, by design) so let's deal with it later. Wordlist mode should convert to UTF-32 upon reading the file. From that point (including rules engine) we're all UTF-32 right until format's set_key().

magnumripper commented 9 years ago

The --internal-encoding option will be obsoleted. Lots of conversions back and forth will no longer be needed. Lots of confusing use-cases will no longer be confusing. Virtually ALL current limitations will be lifted.

In some cases, conversions are added but they are typically very fast. We should try to replace some existing copy in current code with a convert-while-copy in new code.

jfoug commented 9 years ago

To be 100% honest, we really should target getting a Jumbo-2 out before starting this. There has been SO much added and fixed since J-1. Once that is done, then go into building full Unicode support.

magnumripper commented 9 years ago

Yes, definitely. But I'm tempted to start experimenting in a topic branch.

jfoug commented 9 years ago

I do not think some creative ideas and even putting some into code is a problem at all. But the more BIG projects like this we start on, the longer we keep pushing J2 out. I might really bite the bullet and try to work on getting J2 done. However, a LOT of outstanding issues are with things like the GPU code. That is areas of john I have little understanding about.

magnumripper commented 9 years ago

I just closed a couple of issues @Sayantan2048 forgot to tick off. https://github.com/magnumripper/JohnTheRipper/issues?q=is%3Aopen+milestone%3A1.8.0-jumbo-2+label%3Abug doesn't show many GPU issues at all.

frank-dittrich commented 9 years ago

External can either be standalone or filter, but currently in no way supports multibyte characters.

jfoug commented 9 years ago

I added issue for extern and for wordlist. What other 'modes' should have issue tags added? Also, I made the dependent issues into a list, so we can check them off as we work on them.

jfoug commented 9 years ago

I have added another one to the list, that is not in an issue yet. It is a CP <-> Unicode conversion module. If we are going all the way (32 bit unicode), then we will certainly want to consider handling this properly. I have been doing quite a bit of research. I think it can be done efficiently and easily, mostly with table lookups, and even be able to auto-generate most of the code/data (like we did to a point with the current CP stuff).

magnumripper commented 9 years ago

Actually for all currently supported codepages we could use the data as-is (just cast the UTF-16 into UTF-32, we currently have no surrogates). But we should initialize all data, not just one codepage. The current code is initialized for a single codepage. After that, all functions can only convert to/from that codepage.

I'd like our new code to be more like convert(dst, dst_enc, src, src_enc) or possibly enc_to_utf32(dst, src, src_enc) and utf32_to_enc(dst, src, dst_enc). However, performance is absolutely the number one priority, not easy coding.

This would enable things like

$ ./john hashfile -w:wordlist.txt -input-enc=cp1252 -rules -format:lm -target-enc=cp850

Conversion from CP1252 to UTF-32 would happen upon reading the wordlist file. Then, all handling would be UTF-32, which makes eg. the rules engine simpler than today - it would be more like the core one, with no "internal encoding" and things like that. Very straightforward. Just before calling format's set_key(), a new conversion would happen that make it CP850.

magnumripper commented 9 years ago

Note: MB code pages are complex

For UTF-8 we already have very fast code that doesn't use table lookup. I'm not sure we need to support any other multi-byte codepage. But sure, some chinese one(s) would probably be cool.

jfoug commented 9 years ago

The MB I think we should look at are:

sjis family (easy, just 1 or 2 bytes.) iso2022 family (harder, 1, 2 or 3 bytes, and multiple esc sequences that put us into 'modes') euc family big5 family hz (NOPE! old 7 bit encoding for usenet for chinese. All based upon esc-start chinese esc-end and anything outside escapes are 'normal' 7 bit ascii. Dead and complex, bad ROI)

utf-8 is a whole different beast, since it really is a 'coding'. The bits themselves compute the 32 bit number. It is a coding, much like any 'compression' algorithm (huffman, lz, etc). It is perfectly easy to represent in a small amount of code. The code pages are NOT straight forward like that. They simply contain an unordered 'set' of values from the unicode universe.

S-JIS is somewhat easy, except that is that all in the family but 7bit-jis can have 7 bit values for 2nd byte, SO you can not easily do random access into a string and know 'what' the first byte you see represents if the high bit is not set. You can figure things out if you can look at 1 byte prior if the 'first' byte has no high bit set.

I am working on some fast code right now. I have not started coding, I am just researching right now. But this is something we 'could' add pretty easily, and I may even make a stand alone converter tool of my own ownership ;)

magnumripper commented 9 years ago

Luckily we wont need random access. We'll just need enc_to_utf32() and utf32_to_enc(). Any and all processing of strings will be in UTF-32.

jfoug commented 9 years ago

But each line 'is' random access within the file IF we look at it as only that line. For the escaped types, I am not sure we can properly read them at all. The first thing in the file may be an escape that says is a 'shift-sticky' that sets the mode of the next characters (up to the point of another escape sequence) to be in one of the selectable code pages. But if the converter simply gets 'a' next line from the file it will NOT know that the code is in this sticky set code page vs being in the 'starting point' code page. ios-2022-* work in this way. We may simply not be able to utilize them, 'unless' we convert the entire file from start to finish into utf8. And in that point, it should be done by iconv and not within john.

But other MBCS can be used (such as shift-jis). In those languages, there are no 'state setting' escapes. Only escapes 'similar' to utf8 which list to read more bytes for this character. Easy stuff to do.

Like I said, I am still researching. Up to this point, I had only ever worked with single page code pages (i.e. CP's that only have 256 values).

This is a pretty decent intro:

https://en.wikipedia.org/wiki/ISO/IEC_2022

I think we may also be able to do the EUC encodings. They are just 1 byte, 2 byte and 3 byte 'pages' (yes, I can do them, and have code to get the data from Perl).

http://www.sljfaq.org/afaq/encodings.html

I know we can do the JIS family. The only one I think we may not be able to handle, is the escape state settings within ISO-2022. But ISO-2022 also seems to behave 'like' JIS (at least in perl), so whether the escapes are used in real world or not is hard to say. It appears that this code page is only used by a few older email packages which were limited to 7 bit. that is what the escapes were used for. So that a byte 'pair' would not have to be output to get a 2 byte (or 3 byte) value. You could use a escape on or escape next character sequence, and then know that the next byte is the '2nd' byte of a code page (or something like that).

magnumripper commented 9 years ago

We may simply not be able to utilize them, 'unless' we convert the entire file from start to finish into utf8. And in that point, it should be done by iconv and not within john.

Definitely.

jfoug commented 9 years ago

Well it does look like support should be easy to add (cptoutf8 and utf8tocp only), for all CP's except the iso-2022 family. Even EPCDIC is a easy to do thing, except for the fact, that jtr would not be able to read the files, \n is not at offset 0x0A, so fgetl() will and the wordlist.c chopping code will not work.

magnumripper commented 9 years ago

IMHO we should concentrate on other things than supporting new odd encodings no-one asked for. Supporting EBCDIC would be ridiculous.

magnumripper commented 9 years ago

OTOH supporting most ISO-8859 encodings and Windows codepages would be fair enough.

jfoug commented 9 years ago

I have not started to code anything. I am just researching, and building perl scripts to generate tables. Yes our 'original' tables could have been used, but I am redoing them, and will simply get all data we want in a single swoop. The tables can be much less complex now, since we do not give a f\/(# about upcasing, distinguishing numbers, etc within a code page any more.

But we WILL have to care about what is upcase/lowcase/digit within rules.

That brings up a really interesting question. How will we make classes for things like u/l/s/d ?

magnumripper commented 9 years ago

I might start experimenting (with core UTF-32) in a week or so, earliest. It's a good thing we discuss and brainstorm a lot before actually doing anything.

magnumripper commented 9 years ago

I'm planning something like this, as a test (of performance and other things) next week:

Drop all plugin formats except NT (nt2)
Do only cracker.c and inc.c (et al) , plus all things that break because of the changes. Try to limit changes as much as possible to get a working incremental mode. Don't touch any other mode (disable them if needed).
Do a lot of testing!

magnumripper commented 8 years ago

I had a serious look at Markov mode today. I think that format can't just be "converted" to use int instead of char, because it has lots of arrays of 256 and 256*256 and even 256*256*256. When replacing 256 with Unicode's 0x10FFFF the numbers grow "a bit" out of hands...

I'm still hoping it's doable for Incremental.

magnumripper commented 8 years ago

OTOH maybe Markov mode could be redesigned to use "actual number of trained characters" so we often just end up with much lower numbers like 1412 * 1412 if we happen to have 1412 different code points in our training data. I think that's what Incremental does basically.

frank-dittrich commented 8 years ago

Yes, I think this needs to be done. Of course, the stats file format needs to be changed.

Ideally, the new implementation would still be able to use the old format stats file. Alternatively, the error message could mention a script which converts the old stats file into the new format.

magnumripper commented 8 years ago

It's not a lot of code. One viable alternative is to make a "copy" and a new mode, mkv32.c and so on, leaving the old mode unchanged (in terms of options it could be --markov32 or perhaps still just --markov but having --enc options pick which of them is used). This way we can compare them side-by-side for performance and other things.

Once that is working, we can consider merging them back to one (with compatibility) again if we want to (just a matter of parsing the old file format).

magnumripper commented 5 years ago

See #3510

openwall / john

Full UTF-32 support through-out Jumbo #1632