Unicode: fix reading buffer overflow in valid_utf8()

AlekseyCherepanov commented 3 weeks ago

valid_utf8() reads up to 2 bytes far behind the end of string because it checks bytes right-to-left. I switched order to check left-to-right. It means that inner switch should be split. Also I moved around a few ifs: < 0xC2 is checked before main switch, so the switch is for almost valid starting bytes, > 0xF4 was moved into case 4 now because other cases did not need it (in both versions).

diff looks terrible. I guess review of full code would be easier.

To validate my changes, I wrote a script in python to call the function. To call the function, I compiled unicode.c alone as a dynamic library and used python's ctypes. fake.c is needed to resolve unused symbols belonging to other files in john.

The script:

# code to structure allowed utf-8 chars, test john

from ctypes import *

'''
int options;
int real_error;
int john_MD4_Final;
int john_MD4_Update;
int john_MD4_Init;
''' # ^ contents of fake.c

# gcc -fPIC -shared -Wl,-no-undefined john/src/unicode.c fake.c -o u.so
john_valid_utf8 = CDLL("./u.so").valid_utf8
john_valid_utf8.restype = c_int
john_valid_utf8.argtypes = [ c_char_p ]

def check_john(r, b):
    if r == False:
        expected = 0
    elif len(b) == 1:
        expected = 1 # pure ascii
    else:
        expected = 2 # some non-ascii, we have only 1
    j = john_valid_utf8(b)
    assert j == expected, (j, expected, r, b)

# def check_john(r, b):
#     pass

def R(a, b):
    return list(range(a, b))

# Rough ranges
ascii    = R(0x0, 0x80)
trailing = R(0x80, 0xc0)
start2   = R(0xc0, 0xe0)
start3   = R(0xe0, 0xf0)
start4   = R(0xf0, 0x100)

not_trailing = ascii + start2 + start3 + start4
all_bytes = R(0x0, 0x100)

def is_valid(*args):
    b = bytes(args)
    r = None
    try:
        b.decode('utf-8')
        r = True
    except UnicodeDecodeError:
        r = False
    check_john(r, b)
    return r

def combine(args):
    r = [ 0, 0 ] # False and True are 0 and 1
    if len(args) == 1:
        for a in args[0]:
            r[is_valid(a)] += 1

    elif len(args) == 2:
        for a in args[0]:
            for b in args[1]:
                r[is_valid(a, b)] += 1

    elif len(args) == 3:
        for a in args[0]:
            for b in args[1]:
                for c in args[2]:
                    r[is_valid(a, b, c)] += 1

    elif len(args) == 4:
        for a in args[0]:
            for b in args[1]:
                for c in args[2]:
                    for d in args[3]:
                        r[is_valid(a, b, c, d)] += 1

    else:
        assert 0, 'wrong number of args'
    return r

def expect(count_false_and_true, *args):
    assert not (count_false_and_true[0] == -1 and count_false_and_true[1] == -1)
    for i in 0, 1:
        if count_false_and_true[i] == all:
            t = 1
            for a in args:
                t *= len(a)
            count_false_and_true[i] = t
    r = combine(args)
    assert r == count_false_and_true, (r, count_false_and_true)
    return r

# 1 byte
expect([ 0, len(ascii) ], ascii)
expect([ len(trailing), 0 ], trailing)
expect([ len(start2), 0 ], start2)
expect([ len(start3), 0 ], start3)
expect([ len(start4), 0 ], start4)

# 2 bytes starting with trailing
expect([ len(trailing) * len(all_bytes), 0 ], trailing, all_bytes)

# start2
# c0 and c1 are not in the range
expect([ 2 * len(trailing), 0 ], [ 0xc0, 0xc1 ], trailing)
expect([ 2 * len(trailing), (len(start2) - 2) * len(trailing) ], start2, trailing)

expect([ len(start2) * len(not_trailing), 0 ], start2, not_trailing)
expect([ len(start2) * len(trailing) ** 2, 0 ],
       start2, trailing, trailing)

# start3

# E0 80..9F X and ED A0..BF X are invalid
expect([ all, 0 ], [ 0xE0 ], R(0x80, 0xA0), all_bytes)
expect([ all, 0 ], [ 0xED ], R(0xA0, 0xC0), all_bytes)
r1 = expect([ all, 0 ], [ 0xE0 ], R(0x80, 0xA0), trailing)
r2 = expect([ all, 0 ], [ 0xED ], R(0xA0, 0xC0), trailing)
expect([ r1[0] + r2[0], len(start3) * len(trailing) ** 2 - r1[0] - r2[0] ],
       start3, trailing, trailing)

# start4

# F0 80..8F X Y and F4 90..BF X Y are invalid

expect([ 0, all ], [ 0xF0 ], R(0x90, 0xC0), trailing, trailing)
expect([ all, 0 ], [ 0xF0 ], R(0x80, 0x90), all_bytes, all_bytes)
r1 = expect([ all, 0 ], [ 0xF0 ], R(0x80, 0x90), trailing, trailing)

expect([ 0, all ], [ 0xF4 ], R(0x80, 0x90), trailing, trailing)
expect([ all, 0 ], [ 0xF4 ], R(0x90, 0xC0), all_bytes, all_bytes)
r2 = expect([ all, 0 ], [ 0xF4 ], R(0x90, 0xC0), trailing, trailing)

# F5..FF X Y Z are invalid
r3 = expect([ all, 0 ], R(0xF5, 0x100), trailing, trailing, trailing)

expect([ r1[0] + r2[0] + r3[0], len(start4) * len(trailing) ** 3 - r1[0] - r2[0] - r3[0] ],
       start4, trailing, trailing, trailing)

# expect([ len(start3) * len(all_bytes) * len(not_trailing), 0 ],
#        start3, all_bytes, not_trailing)
# expect([ len(start4) * len(all_bytes) ** 2 * len(not_trailing), 0 ],
#        start4, all_bytes, all_bytes, not_trailing)

AlekseyCherepanov commented 3 weeks ago

Minor note: gcc 12 expanded switch for E0,ED,F0,F4 into series of checks, so splitting it into 2 switch statements does not increase size. Actually size in disassembly became smaller because case 1 is not reachable anymore (previously invalid trailing bytes at the start could land into this branch, but I moved their check out from the switch; I count size in disassembly this way: objdump -d unicode.o | sed -ne '/<valid_utf8>:/,/^$/ p' | wc -l).

valid_utf8() is used in multiple places including loader.c. Let's overread line_buf:

$ perl -C0 -e 'print ":12345" x 170 . "a\xf1"' > t.pw && ../run/john t.pw
=================================================================
==1583178==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffc92e219f0 at pc 0x55c00d0dc209 bp 0x7ffc92e21480 sp 0x7ffc92e21478
READ of size 1 at 0x7ffc92e219f0 thread T0
    #0 0x55c00d0dc208 in valid_utf8 .../src/unicode.c:554
    #1 0x55c00d0647b2 in read_file .../src/loader.c:255
    #2 0x55c00d06c368 in ldr_load_pw_file .../src/loader.c:1189
    #3 0x55c00d05a667 in john_load .../src/john.c:1149
    #4 0x55c00d05a667 in john_init .../src/john.c:1598
    #5 0x55c00d05a667 in main .../src/john.c:2084
    #6 0x7fdf45646249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #7 0x7fdf45646304 in __libc_start_main_impl ../csu/libc-start.c:360
    #8 0x55c00cc0f1f0 in _start (.../run/john+0x1ff1f0)

Address 0x7ffc92e219f0 is located in stack of thread T0 at offset 1280 in frame
    #0 0x55c00d063f1b in read_file .../src/loader.c:209

  This frame has 2 object(s):
    [48, 192) 'file_stat' (line 210)
    [256, 1280) 'line_buf' (line 212) <== Memory access at offset 1280 overflows this variable
[...]

solardiz commented 2 weeks ago

Wow, that's an extensive test script you wrote. Maybe it should be made part of our unit tests invoked on make check, but only when Python is available?

valid_utf8() is used in multiple places including loader.c. Let's overread line_buf: $ perl -C0 -e 'print ":12345" x 170 . "a\xf1"' > t.pw && ../run/john t.pw

I guess you mean with code from prior to this PR's changes, to reproduce the original bug?

AlekseyCherepanov commented 2 weeks ago

Yes, ASAN gets triggered prior to the patch. And it does not fire after the patch.

AlekseyCherepanov commented 2 weeks ago

The script is slow. It takes 70 seconds using PyPy3 and 57 seconds after compilation with cython3 (without changes to the script: cython3 --embed -3 test_uni2.py && gcc -I/usr/include/python3.11/ test_uni2.c -lpython3.11). I rewrote it in C to make it faster and more structured. It takes 3.7 seconds (I use sparse checks for some ranges; 13 seconds without them, but I still check more than the script). I'll add a commit tomorrow.

solardiz commented 2 weeks ago

It takes 3.7 seconds (I use sparse checks for some ranges; 13 seconds without them, but I still check more than the script). I'll add a commit tomorrow.

I think try even sparser checks for under 1 second. Thanks!

AlekseyCherepanov commented 2 weeks ago

Nice, CI found problems with symbols referenced in unicode.o I did not have locally. I can reproduce them: --without-openssl is required for ./configure for that. I'll fix it.

Meanwhile a few words about the commit:

I did not feel comfortable inserting a lot of code into unit-tests.c. But I needed functions to integrate with the report. So I included my file using #include. (Afterthought: _test_valid_utf8 could use a local counter for tests and return it, so it would be ok to compile it separately. OTOH I also use hex() from unit-tests.c.)

In regular make and Makefile.legacy, I compile with actual unicode.o instead of sources. Makefile.legacy required -lcrypto for MD4_Init and friends. I picked it from command in the regular build.

I think memory.c and common.c do not need separate compilation. _JOHN_MISC_NO_LOG is internal to misc.c. So there should not be actual deviations from regular compilation for these files and .o could be reused.

BTW I have idea that switch could be replaced with explicit checks with the same "fall-through" style. Then the code would not need the lookup having the same number of checks in assembly.

ASAN build:

  test 38 test_valid_utf8         -  Performed 49022799 tests 0.76 seconds used

In regular build, the test takes under 0.4 seconds.

So I have 2 more changes in mind:

reuse .o for memory.c and common.c
replace lookup+switch with ifs on values

AlekseyCherepanov commented 2 weeks ago

In regular make and Makefile.legacy, I compile with actual unicode.o instead of sources. Makefile.legacy required -lcrypto for MD4_Init and friends. I picked it from command in the regular build.

This part was wrong. Now I compile tests/unicode.o separately defining UNICODE_NO_OPTIONS and NOT_JOHN, so it does not use options and MD4_* and I don't need -lcrypto and/or fake symbols.

Guarding with #ifndef UNICODE_NO_OPTIONS was slightly incomplete. So one commit is dedicated to the fix.

AlekseyCherepanov commented 2 weeks ago

I use unicode.o as a dependency for tests/unicode.o as a shortcut to reuse all dependencies including unused options.h. I considered it to be a better solution than copy-pasting dependencies with or without modifications. unicode.o should be built anyway in usual places where unit-tests gets build.

Other way could be to add tests/unicode.o as target near unicode.o. Like this:

unicode.o tests/unicode.o: [...]

AlekseyCherepanov commented 2 weeks ago

And the last commit is to remove lookup from valid_utf8() rewriting switch with explicit ifs. The code has similar structure with "fall-through" semantics. (And it looks strange if I try to forget that it is a switch by nature.) Now default case falls into 4+ bytes and then gets filtered out in > 0xF4 check.

I don't insist on this change.

But I have an idea for even more obscure modification: switch for E0/ED could be replaced with the following bit tricks.

if (*source == (0xE0 | (0xED & -(a > 0x9F)))) return 0;

As of E0 | ED gives ED, it is possible to OR values to get ED if the check gives true, and E0 otherwise. Fortunately the check for E0 complements the check for ED. So we don't need to check twice. (To be honest, I cannot formulate a proof easily despite that I inferred this trick myself and checked against the tests.) Just replacing constants, similar check can be written for F0/F4.

Simpler bit trick is ((a = *++srcptr) & 0xC0) == 0x80 to check for trailing bytes. & 0xC0 picks 2 highest bits, == 0x80 checks for 10 there. It might reduce number of branching instructions.

Some benchmark is needed to consider the tricks. Also I am not sure how it will play with future compiler optimizations/changes. So I am not going to commit them if they are not interesting/needed. Are they worth a benchmark?

solardiz commented 2 weeks ago

@AlekseyCherepanov Is this ready to merge? @magnumripper Would you like to take another look before we merge?

AlekseyCherepanov commented 2 weeks ago

It is ready to be merged. I'd like to hear from @magnumripper about my changes to unicode.h and readability of the final code with ifs. I don't have any understanding if my change to unicode.h is valid in relation to other code backing encodings. The change is based on local context only.

AlekseyCherepanov commented 1 week ago

Ouch, I intended to remove the explanations from test_valid_utf8.c but I did not include it into the commit.

I guess unicode.h might be the best place to keep the info. It follows wider pattern that things are explained near interface for caller. Initially I did not think about it reading code directly in unicode.c. But I had to read the comments in unicode.h to understand return value for use in the test script.

AlekseyCherepanov commented 1 week ago

I moved the comments to unicode.h. Also I added a bit of elaboration for return value. And I added a link about noncharecters (because I spent unexpected amount of time reading Wikipedia semi-randomly (e.g. about PUA) and had to find missing pieces to match my incomplete understanding with Python's behaviour).

We could rewrite commits to squash initial misplacement of comments and incorrect changes to Makefile.in together with respective fixes into nicer commits. Should I do that?

If there are no objections then this PR can be merged.

solardiz commented 1 week ago

We could rewrite commits to squash initial misplacement of comments and incorrect changes to Makefile.in together with respective fixes into nicer commits. Should I do that?

Yes, if you like. You may also fix the typo in noncharecters. Thank you!

@magnumripper Please feel free to merge this when you feel it's ready, or let me know and I merge it.

magnumripper commented 1 week ago

I'm fine with you merging it, squashed or not.

AlekseyCherepanov commented 6 days ago

typo in noncharecters

Oh, nice catch! Thanks!

I fixed the typo. I merged fixes into respective commits.

After pushing I noticed that I wrote too long message for one commit, so I rewrote it and pushed again. Then I rebased the first commit without changes, so all six commits would be listed in a row. (I am puzzled about the rebase: git log shows old date, github shows that change is recent and I don't see the date. Local hash and github's hash are the same.) I guess the last step was not needed because list of commits in a pull request can be viewed in the "commits" tab. Rebasing I could use the latest john, but I did not.

I finished my changes here.

solardiz commented 6 days ago

I am puzzled about the rebase: git log shows old date, github shows that change is recent

Each commit has two dates on it, see git log --pretty=fuller. If you want to update both when you amend a commit, use --reset-author. You don't need to do that here - I merely include this explanation because you wondered.

openwall / john

Unicode: fix reading buffer overflow in valid_utf8() #5531