Closed AlekseyCherepanov closed 2 months ago
Minor note: gcc
12 expanded switch
for E0,ED,F0,F4 into series of checks, so splitting it into 2 switch
statements does not increase size. Actually size in disassembly became smaller because case 1
is not reachable anymore (previously invalid trailing bytes at the start could land into this branch, but I moved their check out from the switch
; I count size in disassembly this way: objdump -d unicode.o | sed -ne '/<valid_utf8>:/,/^$/ p' | wc -l
).
valid_utf8()
is used in multiple places including loader.c
. Let's overread line_buf
:
$ perl -C0 -e 'print ":12345" x 170 . "a\xf1"' > t.pw && ../run/john t.pw
=================================================================
==1583178==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffc92e219f0 at pc 0x55c00d0dc209 bp 0x7ffc92e21480 sp 0x7ffc92e21478
READ of size 1 at 0x7ffc92e219f0 thread T0
#0 0x55c00d0dc208 in valid_utf8 .../src/unicode.c:554
#1 0x55c00d0647b2 in read_file .../src/loader.c:255
#2 0x55c00d06c368 in ldr_load_pw_file .../src/loader.c:1189
#3 0x55c00d05a667 in john_load .../src/john.c:1149
#4 0x55c00d05a667 in john_init .../src/john.c:1598
#5 0x55c00d05a667 in main .../src/john.c:2084
#6 0x7fdf45646249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
#7 0x7fdf45646304 in __libc_start_main_impl ../csu/libc-start.c:360
#8 0x55c00cc0f1f0 in _start (.../run/john+0x1ff1f0)
Address 0x7ffc92e219f0 is located in stack of thread T0 at offset 1280 in frame
#0 0x55c00d063f1b in read_file .../src/loader.c:209
This frame has 2 object(s):
[48, 192) 'file_stat' (line 210)
[256, 1280) 'line_buf' (line 212) <== Memory access at offset 1280 overflows this variable
[...]
Wow, that's an extensive test script you wrote. Maybe it should be made part of our unit tests invoked on make check
, but only when Python is available?
valid_utf8() is used in multiple places including loader.c. Let's overread line_buf: $ perl -C0 -e 'print ":12345" x 170 . "a\xf1"' > t.pw && ../run/john t.pw
I guess you mean with code from prior to this PR's changes, to reproduce the original bug?
Yes, ASAN gets triggered prior to the patch. And it does not fire after the patch.
The script is slow. It takes 70 seconds using PyPy3 and 57 seconds after compilation with cython3 (without changes to the script: cython3 --embed -3 test_uni2.py && gcc -I/usr/include/python3.11/ test_uni2.c -lpython3.11
). I rewrote it in C to make it faster and more structured. It takes 3.7 seconds (I use sparse checks for some ranges; 13 seconds without them, but I still check more than the script). I'll add a commit tomorrow.
It takes 3.7 seconds (I use sparse checks for some ranges; 13 seconds without them, but I still check more than the script). I'll add a commit tomorrow.
I think try even sparser checks for under 1 second. Thanks!
Nice, CI found problems with symbols referenced in unicode.o
I did not have locally. I can reproduce them: --without-openssl
is required for ./configure
for that. I'll fix it.
Meanwhile a few words about the commit:
I did not feel comfortable inserting a lot of code into unit-tests.c
. But I needed functions to integrate with the report. So I included my file using #include
. (Afterthought: _test_valid_utf8
could use a local counter for tests and return it, so it would be ok to compile it separately. OTOH I also use hex()
from unit-tests.c
.)
In regular make
and Makefile.legacy
, I compile with actual unicode.o
instead of sources. Makefile.legacy
required -lcrypto
for MD4_Init
and friends. I picked it from command in the regular build.
I think memory.c
and common.c
do not need separate compilation. _JOHN_MISC_NO_LOG
is internal to misc.c
. So there should not be actual deviations from regular compilation for these files and .o
could be reused.
BTW I have idea that switch
could be replaced with explicit checks with the same "fall-through" style. Then the code would not need the lookup having the same number of checks in assembly.
ASAN build:
test 38 test_valid_utf8 - Performed 49022799 tests 0.76 seconds used
In regular build, the test takes under 0.4 seconds.
So I have 2 more changes in mind:
.o
for memory.c
and common.c
if
s on valuesIn regular
make
andMakefile.legacy
, I compile with actualunicode.o
instead of sources.Makefile.legacy
required-lcrypto
forMD4_Init
and friends. I picked it from command in the regular build.
This part was wrong. Now I compile tests/unicode.o
separately defining UNICODE_NO_OPTIONS
and NOT_JOHN
, so it does not use options
and MD4_*
and I don't need -lcrypto
and/or fake symbols.
Guarding with #ifndef UNICODE_NO_OPTIONS
was slightly incomplete. So one commit is dedicated to the fix.
I use unicode.o
as a dependency for tests/unicode.o
as a shortcut to reuse all dependencies including unused options.h
. I considered it to be a better solution than copy-pasting dependencies with or without modifications. unicode.o
should be built anyway in usual places where unit-tests
gets build.
Other way could be to add tests/unicode.o
as target near unicode.o
. Like this:
unicode.o tests/unicode.o: [...]
And the last commit is to remove lookup from valid_utf8()
rewriting switch
with explicit if
s. The code has similar structure with "fall-through" semantics. (And it looks strange if I try to forget that it is a switch
by nature.) Now default
case falls into 4+ bytes
and then gets filtered out in > 0xF4
check.
I don't insist on this change.
But I have an idea for even more obscure modification: switch
for E0/ED could be replaced with the following bit tricks.
if (*source == (0xE0 | (0xED & -(a > 0x9F)))) return 0;
As of E0 | ED
gives ED
, it is possible to OR values to get ED if the check gives true, and E0 otherwise. Fortunately the check for E0 complements the check for ED. So we don't need to check twice. (To be honest, I cannot formulate a proof easily despite that I inferred this trick myself and checked against the tests.) Just replacing constants, similar check can be written for F0/F4.
Simpler bit trick is ((a = *++srcptr) & 0xC0) == 0x80
to check for trailing bytes. & 0xC0
picks 2 highest bits, == 0x80
checks for 10
there. It might reduce number of branching instructions.
Some benchmark is needed to consider the tricks. Also I am not sure how it will play with future compiler optimizations/changes. So I am not going to commit them if they are not interesting/needed. Are they worth a benchmark?
@AlekseyCherepanov Is this ready to merge? @magnumripper Would you like to take another look before we merge?
It is ready to be merged. I'd like to hear from @magnumripper about my changes to unicode.h
and readability of the final code with if
s. I don't have any understanding if my change to unicode.h
is valid in relation to other code backing encodings. The change is based on local context only.
Ouch, I intended to remove the explanations from test_valid_utf8.c
but I did not include it into the commit.
I guess unicode.h
might be the best place to keep the info. It follows wider pattern that things are explained near interface for caller. Initially I did not think about it reading code directly in unicode.c
. But I had to read the comments in unicode.h
to understand return value for use in the test script.
I moved the comments to unicode.h
. Also I added a bit of elaboration for return value. And I added a link about noncharecters
(because I spent unexpected amount of time reading Wikipedia semi-randomly (e.g. about PUA) and had to find missing pieces to match my incomplete understanding with Python's behaviour).
We could rewrite commits to squash initial misplacement of comments and incorrect changes to Makefile.in
together with respective fixes into nicer commits. Should I do that?
If there are no objections then this PR can be merged.
We could rewrite commits to squash initial misplacement of comments and incorrect changes to
Makefile.in
together with respective fixes into nicer commits. Should I do that?
Yes, if you like. You may also fix the typo in noncharecters
. Thank you!
@magnumripper Please feel free to merge this when you feel it's ready, or let me know and I merge it.
I'm fine with you merging it, squashed or not.
typo in
noncharecters
Oh, nice catch! Thanks!
I fixed the typo. I merged fixes into respective commits.
After pushing I noticed that I wrote too long message for one commit, so I rewrote it and pushed again. Then I rebased the first commit without changes, so all six commits would be listed in a row. (I am puzzled about the rebase: git log
shows old date, github
shows that change is recent and I don't see the date. Local hash and github's hash are the same.) I guess the last step was not needed because list of commits in a pull request can be viewed in the "commits" tab. Rebasing I could use the latest john, but I did not.
I finished my changes here.
I am puzzled about the rebase:
git log
shows old date,github
shows that change is recent
Each commit has two dates on it, see git log --pretty=fuller
. If you want to update both when you amend a commit, use --reset-author
. You don't need to do that here - I merely include this explanation because you wondered.
valid_utf8()
reads up to 2 bytes far behind the end of string because it checks bytes right-to-left. I switched order to check left-to-right. It means that innerswitch
should be split. Also I moved around a fewif
s:< 0xC2
is checked before mainswitch
, so theswitch
is for almost valid starting bytes,> 0xF4
was moved intocase 4
now because other cases did not need it (in both versions).diff looks terrible. I guess review of full code would be easier.
To validate my changes, I wrote a script in python to call the function. To call the function, I compiled
unicode.c
alone as a dynamic library and used python'sctypes
.fake.c
is needed to resolve unused symbols belonging to other files in john.The script: