problems with non-ascii characters.

frank-dittrich commented 8 years ago

What am I missing?

$ echo -n -e 'a\nä\n€\n' > test.txt
$ wc test.txt
3 3 9 test.txt
$ file test.txt 
test.txt: UTF-8 Unicode text
$ rexgen -u8 -f test.txt '\0'
a
��菉
��מּ

frank-dittrich commented 8 years ago

And with the same input file:

$ rexgen -f test.txt '\0'
a
??
???

frank-dittrich commented 8 years ago

With the debug output to strerr re-enabled (see issue #25), I get:

$ rexgen -f test.txt '\0'
APPEND
0061
a
APPEND
ffc3
ffa4
??
APPEND
ffe2
ff82
ffac
???
$ rexgen -u8 -f test.txt '\0'
APPEND
0061
a
APPEND
ffc3
ffa4
��菉
APPEND
ffe2
ff82
ffac
��מּ

Looks like for non-ascii characters, instead of a conversion to utf-16, each individual byte of that character is preceded by 0xff.

frank-dittrich commented 8 years ago

Still with that same file, the following command results in a segfault:

$ rexgen -f test.txt '(?i:\0)' 
APPEND
0061
a
A
APPEND
0061
a
A
APPEND
ffc3
ffa4
??
APPEND
ffe2
ff82
ffac
Segmentation fault (core dumped)

magnumripper commented 8 years ago

Looks like for non-ascii characters, instead of a conversion to utf-16, each individual byte of that character is preceded by 0xff.

It's just a sign extension issue (or several of them). Here's a fix (partial or not)

diff --git a/src/librexgen/string/simplestring.cpp b/src/librexgen/string/simplestring.cpp
index 9c52188..d88fc42 100644
--- a/src/librexgen/string/simplestring.cpp
+++ b/src/librexgen/string/simplestring.cpp
@@ -58,7 +58,7 @@ void SimpleString::append(const char* ch, size_t ch_len) {

        /* convert and copy characters */
        for (size_t idx=0; idx<ch_len; ++idx) {
-               characters[length++] = ch[idx];
+               characters[length++] = (unsigned char)ch[idx];

     fprintf(stderr, "%04x\n", characters[length-1].codepoint);
        }

This looks a bit better with rexgen -u8 but more is needed.

$ echo ä | rexgen -f - -u8 '\0[a-c]' 
APPEND
00c3
00a4
?ä?a
APPEND
00c3
00a4
?ä?b
APPEND
00c3
00a4
?ä?c

For some reason it looks completely right without -u8... do I misunderstand that option?

$ echo ä | rexgen -f - '\0[a-c]' 
APPEND
00c3
00a4
äa
APPEND
00c3
00a4
äb
APPEND
00c3
00a4
äc

janstarke commented 8 years ago

The problem is greater than just using a type cast. Internally, I use uint16_t for storing unicode codepoints, but you cannot convert any char to a codepoint (especially those which are not ascii, such as ä). Unfortunately, this is excatly what rexgen is doing right now. I just reviewed the code and identified the following TODOs for me:

define a custom type for codepoints, so that the compiler warns if one tries to cast between char and codepoint
implement a function which can convert a char into a codepoint, depending on the locale used (Hopefully someone did this before and I can use a library function)
Remove the dangerous conversion constructor uchar_t(uint16_t) and replace it by a factory method, so that the creation of a uchar_t out of a codepoint must be requested explicitly

Hmm... Maybe it's a good decision to combine 2. and 3. ... I do not fiddle around with codepoints, but only with uchar_t. So my conversion function should create a u_char_t and I will not need the weird constructor anymore

I don't have my Stevens-book at home, but in the Office, so this will take some days. Regards, Jan

frank-dittrich commented 8 years ago

This might help: http://utfcpp.sourceforge.net/ I think just getting utf-8 handling right is much more important than the larger task of accepting all kinds of encodings.

Supporting various other encodings will cause many more problems.

E.g., in iso-latin1 (ISO_8859-1), 0xff represents character ÿ (lower case y with diaresis). You can convert that into a unicode code point and toggle the case. But iso-latin1 (ISO_8859-1) does not contain the corresponding upper case character.

magnumripper commented 8 years ago

implement a function which can convert a char into a codepoint, depending on the locale used (Hopefully someone did this before and I can use a library function)

That's exactly what I thought you used libicu for! What DO you use it for?

magnumripper commented 8 years ago

depending on the locale used

On a (sort of) side-note, please note that when JtR uses librexgen, we often want some other locale than what the system uses. Furthermore, we don't (and can't) use setlocale(3) for it. Typically, the system & terminal are UTF-8 but regardless of that we also have (long story short) the concept of a target encoding.

On the other hand, we can probably use the rexgen stuff in UTF-8 "mode" regardless, and do any conversions ourselves. That's what we typically do in all other modes.

janstarke commented 8 years ago

I use libicu for conversion between upper case and lower case; at the moment I'm working on a replacement for this lib (=my own implementation of case conversion), because I don't like the dependency (the same applies to other libraries). ICU is great, but way too much for rexgen. I only need case conversion.

Internally, rexgen works with codepoints, which speeds up character handling and makes conversion to any encoding possible and simple.

BTW, Unicode output works now. Case conversion does not yet work and the segfault is still there, so the issues stays open for now...

magnumripper commented 8 years ago

Then I guess you can simply use mbtowc or mbstowcs (perhaps suffixed with _l) and the likes for conversion.

magnumripper commented 8 years ago

This issue (as described) is fixed as of 1.4.0 (without -u8 which is no longer an option)

$ rexgen -f test.txt '\0'
a
ä
€

janstarke commented 7 years ago

Dear all, I'm was not happy with my current multibyte approach just fiddling around with generic character set support (not only utf-8) and have committed the first approach in feature/utf8:

» rexgen '(?i:äbcß見る)'
äbcß見る
Äbcß見る
ÄBcß見る
äBcß見る
äBCß見る
ÄBCß見る
ÄbCß見る
äbCß見る

I know there are still some minor problem, but at least character encoding seems to be working correctly. Could you give it a try? I'd like to close this issue and be sure it is really resolved.

Regards, Jan

magnumripper commented 7 years ago

Did the interfaces change again? It breaks JtR. I didn't really look into it yet but here's the output:

regex.c: In function 'do_regex_hybrid_crack':
regex.c:221:33: warning: passing argument 2 of 'c_regex_cb' from incompatible pointer type [-Wincompatible-pointer-types]
   regex_ptr = c_regex_cb(regex, callback);
                                 ^~~~~~~~
In file included from regex.h:35:0,
                 from regex.c:12:
/usr/local/include/librexgen/c/librexgen.h:39:13: note: expected 'callback_fp {aka long unsigned int (*)(char *, const long unsigned int)}' but argument is of type 'size_t (*)(wchar_t *, const size_t) {aka long unsigned int (*)(int *, const long unsigned int)}'
 c_regex_ptr c_regex_cb(
             ^~~~~~~~~~
regex.c:290:3: error: too many arguments to function 'c_simplestring_to_utf8_string'
   c_simplestring_to_utf8_string(buffer, &word[0], sizeof(word));
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/local/include/librexgen/c/iterator.h:27:0,
                 from /usr/local/include/librexgen/c/librexgen.h:24,
                 from regex.h:35,
                 from regex.c:12:
/usr/local/include/librexgen/c/simplestring.h:43:13: note: declared here
 const char* c_simplestring_to_utf8_string(c_simplestring_ptr s);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex.c: In function 'do_regex_crack':
regex.c:333:32: warning: passing argument 2 of 'c_regex_cb' from incompatible pointer type [-Wincompatible-pointer-types]
  regex_ptr = c_regex_cb(regex, callback);
                                ^~~~~~~~
In file included from regex.h:35:0,
                 from regex.c:12:
/usr/local/include/librexgen/c/librexgen.h:39:13: note: expected 'callback_fp {aka long unsigned int (*)(char *, const long unsigned int)}' but argument is of type 'size_t (*)(wchar_t *, const size_t) {aka long unsigned int (*)(int *, const long unsigned int)}'
 c_regex_ptr c_regex_cb(
             ^~~~~~~~~~
regex.c:350:3: error: too many arguments to function 'c_simplestring_to_utf8_string'
   c_simplestring_to_utf8_string(buffer, &word[0], sizeof(word));
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/local/include/librexgen/c/iterator.h:27:0,
                 from /usr/local/include/librexgen/c/librexgen.h:24,
                 from regex.h:35,
                 from regex.c:12:
/usr/local/include/librexgen/c/simplestring.h:43:13: note: declared here
 const char* c_simplestring_to_utf8_string(c_simplestring_ptr s);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make[1]: *** [regex.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make: *** [default] Error 2

janstarke commented 7 years ago

Yes, it probably will. I'm currently working on a patch in https://github.com/teeshop/JohnTheRipper

I changed the internal string representation to use std::string, which allows access direct read access to the binary string using std::string::c_str(). Therefore there is no need to copy the whole buffer every time. Additionally, rexgen will be working with any multibyte character set internally. E.g. we can use the internal CP from john without character set conversion.

However, to keep things easy, we could provide c_simplestring_to_utf8_string() as (slow) wrapper, which copies the resulting string into a external (john-internal) buffer and converts it into UTF-8. I see no problem with it; But I would add a #warning that this function is deprecated.

As I changed a lot of the internal working of rexgen, the new interface is not finished. For example, I see the need of being able to limit the resulting string length (instead of creating the full string and truncating it afterwards, as it is done currently). When I use c_str(), the result is a const char*, so there is no write access to the value; and I must provide all required functionality as part of rexgen's interface.

janstarke commented 7 years ago

Hi Magnum,

I did it...

I added c_simplestring_to_utf8_string back to rexgen and john compiles cleanly with it. Also, john seems to be running fine.

Regards, Jan

magnumripper commented 7 years ago

Cool, I'll try it out

magnumripper commented 7 years ago

Here's latest bleeding-jumbo w/ latest rexgen master:

regex.c: In function 'do_regex_hybrid_crack':
regex.c:221:33: warning: passing argument 2 of 'c_regex_cb' from incompatible pointer type [-Wincompatible-pointer-types]
   regex_ptr = c_regex_cb(regex, callback);
                                 ^~~~~~~~
In file included from regex.h:35:0,
                 from regex.c:12:
/usr/local/include/librexgen/c/librexgen.h:39:13: note: expected 'callback_fp {aka long unsigned int (*)(char *, const long unsigned int)}' but argument is of type 'size_t (*)(wchar_t *, const size_t) {aka long unsigned int (*)(int *, const long unsigned int)}'
 c_regex_ptr c_regex_cb(
             ^~~~~~~~~~
regex.c: In function 'do_regex_crack':
regex.c:333:32: warning: passing argument 2 of 'c_regex_cb' from incompatible pointer type [-Wincompatible-pointer-types]
  regex_ptr = c_regex_cb(regex, callback);
                                ^~~~~~~~
In file included from regex.h:35:0,
                 from regex.c:12:
/usr/local/include/librexgen/c/librexgen.h:39:13: note: expected 'callback_fp {aka long unsigned int (*)(char *, const long unsigned int)}' but argument is of type 'size_t (*)(wchar_t *, const size_t) {aka long unsigned int (*)(int *, const long unsigned int)}'
 c_regex_ptr c_regex_cb(
             ^~~~~~~~~~
ar: creating archive aes.a

Make process completed.

janstarke commented 7 years ago

OMG, I forgot to wrap that function :-O I fixed this, now bleeding jumbo and rexgen master compile cleanly on my machine. But there is a lot of string conversion going on. We need to get rid of it. On my way...

magnumripper commented 7 years ago

2.0.1 builds fine here too. Yes, we should try to get rid of conversions back and forth!

janstarke commented 7 years ago

I created a PoC for this issue and committed it as feature/selectlocale. Can you please take a look at it to see if this is useful for you?

A hint about the usage can be found in rexgen.c, lines 204-218.

Unfortunately, JtR uses non-standard constants to declare encodings, which I cannot use (without creating circular dependencies between john and rexgen). Therefore, I'm simply relying on strings, such as "en_US.UTF-8" which you also can use.

magnumripper commented 7 years ago

I'll try to get some time for this real soon. I also need to look into your PR for JtR now that (I think) rexgen compiles fine on both OSX and Linux.

janstarke commented 7 years ago

This is not time-critical. I'm on the road at the moment and it's also hard for me to find some free slots. We'll get this flying someday, anyway ;-)

Thank you for the support, so far.

If I find some minutes I'll try to adapt john to the selectlocale-API to see if this makes things easier.

Regards, Jan

janstarke / rexgen

problems with non-ascii characters. #26