Closed claudioandre-br closed 2 years ago
~The~ One problem is your misuse of iconv.
You can't convert a non-ASCII char to ASCII:
$ echo € | iconv -f utf8 -t ascii
iconv: (stdin):1:0: cannot convert
And you can't convert "from" ASCII when input actually isn't ASCII:
$ echo € | iconv -f ascii -t utf8
iconv: (stdin):1:0: cannot convert
What you can do, is convert to an encoding that can carry all characters in the input. Here, we send a UTF-8 encoded euro sign into iconv, converting it to CP1252. In that codepage the character is represented by 0x80:
$ echo € | hd
00000000 e2 82 ac 0a |?.?.|
$ echo € | iconv -f utf8 -t cp1252 | hd
00000000 80 0a |..|
Disregarding your iconv problem, wordlist + rules can definitely produce words that aren't valid UTF-8. For example, if you send a UTF-8 euro sign through a rule that drops last character (actually last byte), the result is obviously an incomplete multibyte character.
One way to filter them out is to add -ext:filter_utf8
to your command line.
Mostly for testing, there is also the opposite of that filter; -ext:filter_non-utf8
will only produce invalid UTF-8:
$ ../run/john -stdout -w -ru -ext:filter_non-utf8 | head -4
Using default input encoding: UTF-8
Proceeding with wordlist:../run/password.lst, rules:Wordlist
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
Enabling duplicate candidate password suppressor
м?сяня
а?иночка
mu?eca
Your screen output may vary depending on how your terminal handles the problematic output. The terminal I used here apparently replace broken characters with ?
.
I see we have a minor bug in that filter: It treats an empty line as invalid UTF-8 (letting it through). Edit: no, it seems the filter isn't even called for empty lines. Is that a minor bug in external core code?
I just reviewed doc/ENCODINGS and I think it pretty much covers it.
Thanks magnum. My problem isn't iconv
, it's the Rust String that validates invalid UTF-8.
If invalid UTF-8 is expected, I think there is nothing to do but close this.
Try eg. --internal-codepage=cp1252
(or john.conf DefaultInternalCodepage = CP1252
). This should make the problem go away completely (but has other drawbacks, such as suppressing words that doesn't fit in CP1252).
For the record, it works:
grep -P -n "[\x80-\xFF]" ~/bin/bleeding//run/password.lst > list.in
john --wordlist=list.in --internal-codepage=cp1252 --rules --stdout > utf.out
iconv -f utf8 -t utf8 utf.out -o final.txt
Well, I already solved my problem:
String
and use an array of bytes (not ~chars~) (u8
, in fact). It is faster, but it is good to know that john
rules are operating on bytes
and not on chars
.
Well, someone who can understand UTF-8 is needed to figure out if this is a real bug.
Steps that can be used to reproduce.