john produces invalid UTF-8 char

claudioandre-br commented 2 years ago

Well, someone who can understand UTF-8 is needed to figure out if this is a real bug.

Steps that can be used to reproduce.

$ john --wordlist=/home/claudio/bin/bleeding/run/password.lst --max-cand=1000000 --rules --stdout > utf.out
Using default input encoding: UTF-8
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
Enabling duplicate candidate password suppressor
1000000p 0:00:00:00 0.02% (ETA: 10:52:11) 1694Kp/s eresuncabron
Session stopped (max candidates reached)

$ iconv -f ascii -t utf-8 utf.out -o bigg.in
iconv: sequência de entrada ilegal na posição 1302404
# Illegal entry at position 1302404

Or iconv -f utf-8 -t ascii utf.out -o bigg.in
[EDITED]
Or iconv -f utf-8 -t utf-8 utf.out -o bigg.in

magnumripper commented 2 years ago

~The~ One problem is your misuse of iconv.

You can't convert a non-ASCII char to ASCII:

$ echo € | iconv -f utf8 -t ascii
iconv: (stdin):1:0: cannot convert

And you can't convert "from" ASCII when input actually isn't ASCII:

$ echo € | iconv -f ascii -t utf8
iconv: (stdin):1:0: cannot convert

What you can do, is convert to an encoding that can carry all characters in the input. Here, we send a UTF-8 encoded euro sign into iconv, converting it to CP1252. In that codepage the character is represented by 0x80:

$ echo € | hd
00000000  e2 82 ac 0a                                       |?.?.|

$ echo € | iconv -f utf8 -t cp1252 | hd
00000000  80 0a                                             |..|

magnumripper commented 2 years ago

Disregarding your iconv problem, wordlist + rules can definitely produce words that aren't valid UTF-8. For example, if you send a UTF-8 euro sign through a rule that drops last character (actually last byte), the result is obviously an incomplete multibyte character.

One way to filter them out is to add -ext:filter_utf8 to your command line.

Mostly for testing, there is also the opposite of that filter; -ext:filter_non-utf8 will only produce invalid UTF-8:

$ ../run/john -stdout -w -ru -ext:filter_non-utf8 | head -4
Using default input encoding: UTF-8
Proceeding with wordlist:../run/password.lst, rules:Wordlist
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
Enabling duplicate candidate password suppressor

м?сяня
а?иночка
mu?eca

Your screen output may vary depending on how your terminal handles the problematic output. The terminal I used here apparently replace broken characters with ?.

I see we have a minor bug in that filter: It treats an empty line as invalid UTF-8 (letting it through). Edit: no, it seems the filter isn't even called for empty lines. Is that a minor bug in external core code?

magnumripper commented 2 years ago

I just reviewed doc/ENCODINGS and I think it pretty much covers it.

claudioandre-br commented 2 years ago

Thanks magnum. My problem isn't iconv, it's the Rust String that validates invalid UTF-8.

If invalid UTF-8 is expected, I think there is nothing to do but close this.

magnumripper commented 2 years ago

Try eg. --internal-codepage=cp1252 (or john.conf DefaultInternalCodepage = CP1252). This should make the problem go away completely (but has other drawbacks, such as suppressing words that doesn't fit in CP1252).

claudioandre-br commented 2 years ago

For the record, it works:

grep  -P -n "[\x80-\xFF]" ~/bin/bleeding//run/password.lst > list.in
john --wordlist=list.in --internal-codepage=cp1252 --rules --stdout > utf.out
iconv -f utf8 -t utf8 utf.out -o final.txt

Well, I already solved my problem:

the workaround is to avoid the String and use an array of bytes (not ~chars~) (u8, in fact). It is faster, but it is good to know that john rules are operating on bytes and not on chars.

openwall / john

john produces invalid UTF-8 char #5201