landley / toybox

toybox
http://landley.net/toybox
BSD Zero Clause License
2.33k stars 328 forks source link

Crash in config2help during build #458

Open thesamesam opened 9 months ago

thesamesam commented 9 months ago

I originally reported this downstream in Gentoo at https://bugs.gentoo.org/914589.

x86_64-pc-linux-gnu-gcc -O2 -pipe -march=native -fdiagnostics-color=always -frecord-gcc-switches -Wreturn-type -Wall -Wundef -Werror=implicit-function-declaration -Wno-char-subscripts -Wno-pointer-sign -funsigned-char -Wl,-O1 -Wl,--as-needed -Wl,--defsym=__gentoo_check_ldflags__=0 -Wl,-z,pack-relative-relocs -xc - -lssl -o generated/unstripped/libprobe
echo -n  -lssl

generated/{Config.in,newtoys.hx86_64-pc-linux-gnu-gcc scripts/mkflags.c -o generated/unstripped/mkflags
,flags.h,globals.hx86_64-pc-linux-gnu-gcc scripts/mktags.c -o generated/unstripped/mktags
,tags.hx86_64-pc-linux-gnu-gcc scripts/config2help.c -o generated/unstripped/config2help
,help.hscripts/make.sh: line 266:   785 Segmentation fault      (core dumped) "$UNSTRIPPED"/config2help Config.in $KCONFIG_CONFIG > "$GENDIR"/help.h
make: *** [Makefile:17: toybox] Error 1
 * ERROR: sys-apps/toybox-0.8.10::gentoo failed (compile phase):
 *   emake failed

If I jam HOSTCC="gcc -ggdb3", I get this when running generated/unstripped/config2help Config.in .config:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7f134c5 in ?? () from /usr/lib64/libc.so.6
(gdb) bt
#0  0x00007ffff7f134c5 in ?? () from /usr/lib64/libc.so.6
#1  0x00007ffff7e62840 in strndup () from /usr/lib64/libc.so.6
#2  0x00005555555563ac in main (argc=3, argv=0x7fffffffe2d8) at scripts/config2help.c:365
(gdb)

Possibly slightly more useful from valgrind:

==263576== Invalid read of size 1
==263576==    at 0x4848948: strnlen (vg_replace_strmem.c:464)
==263576==    by 0x492C83F: strndup (strndup.c:42)
==263576==    by 0x10A3AB: main (config2help.c:365)
==263576==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==263576==
==263576==
==263576== Process terminating with default action of signal 11 (SIGSEGV)
==263576==  Access not within mapped region at address 0x0
==263576==    at 0x4848948: strnlen (vg_replace_strmem.c:464)
==263576==    by 0x492C83F: strndup (strndup.c:42)
==263576==    by 0x10A3AB: main (config2help.c:365)
==263576==  If you believe this happened as a result of a stack
==263576==  overflow in your program's main thread (unlikely but
==263576==  possible), you can try to increase the size of the
==263576==  main thread stack using the --main-stacksize= flag.
==263576==  The main thread stack size used in this run was 8388608.

From asan/ubsan (probably the most helpful, as you get the naughtiness immediately rather than when it tries to read later on):

# HOSTCC="gcc -ggdb3 -fsanitize=address,undefined" make
scripts/genconfig.sh
scripts/make.sh
Library probe
generated/{Config.in,newtoys.h,flags.h,globals.h,tags.h+ hostcomp config2help
+ '[' '!' -f generated/unstripped/config2help ']'
+ do_loudly gcc -ggdb3 -fsanitize=address,undefined scripts/config2help.c -o generated/unstripped/config2help
+ '[' -n '' ']'
+ echo -n ''
+ gcc -ggdb3 -fsanitize=address,undefined scripts/config2help.c -o generated/unstripped/config2help
+ isnewer help.h generated/Config.in
+ '[' -e generated/help.h ']'
+ echo -n ,help.h
,help.h+ DIDNEWER=,
+ generated/unstripped/config2help Config.in .config
scripts/config2help.c:365:16: runtime error: null pointer passed as argument 1, which is declared to never be null
    #0 0x55655c3a6fc5 in main scripts/config2help.c:365
    #1 0x7efe996f2346  (/usr/lib64/libc.so.6+0x24346)
    #2 0x7efe996f2408 in __libc_start_main (/usr/lib64/libc.so.6+0x24408)
    #3 0x55655c3a2604 in _start (/var/tmp/portage/sys-apps/toybox-0.8.10/work/toybox-0.8.10/generated/unstripped/config2help+0xc604)

+ exit 1
make: *** [Makefile:17: toybox] Error 1
landley commented 9 months ago

The most recent commit to that file was 4 and 1/2 years ago, and we've had 10 releases since then.

The failure is that your libc's strndup() returned NULL. I have two questions about this, both of which are "how?"

The first "how?" is that on an allyesconfig build "toybox help | wc -c" says 154k total help text. For comparison, when I run my devuan x86-64's "sleep" command /proc/$PID/smaps_rollup says 1592 kB referenced. If running this binary exhausts memory, you couldn't have compiled it. (Did you dynamically set some crazy-low ulimit that's an order of magnitude lower than what the compiler is allowed to consume? Some sort of container environment bug, maybe?)

The second "how?" is that running out of physical memory has only ever returned NULL from libc allocation functions on NOMMU Linux. With an MMU, Linux libc suballocates from the heap and extends it by allocating more virtual address space, then the physical memory is dynamically allocated later by the page fault handler during a write access to the COW mapping of the zero page (or the modern eqvivalent that's more SMP friendly and avoids the reference count wrapping issue big iron hit, but conceptually the same). The OOM killer picks a process to SIGKILL at this time if it can't satisfy the allocation, usually the one that asked for more memory but there's a bucket of terrible heuristics that gets stirred from time to time to make it bigger.

Recognition of "we actually don't have enough memory" happpens long after the allocation of virtual space has returned, unless you twiddle some "make the system extremely unstable" knobs ("disable overcommit") under /proc which basically ask programs to speculatively fail when they otherwise would have succeeded. (I.E. it never makes MORE stuff work, it only ever makes LESS stuff work. It changes the Linux programming API.)

Toybox proper always checks those return values (xmalloc() and friends) because it supports running on NOMMU systems that can arbitrarily return NULL due to memory fragmentation and similar, and because commands exiting with failure happens all the time, ala "echo > /dev/full" and so on. But we don't support building on nommu systems, which is why I didn't bother to check for a NULL return from allocations in build tools processing known inputs: if it does fail, all I can do is exit() because the system is completely horked. I could change the line to exit with a more graceful error message, but your build would still break.

It's theoretically possible that the failing call to strndup() has some other reason for returning NULL? Maybe it got dodgy arguments (a negative length maybe?) and I'm curious what its inputs were when it returned NULL. It's also possible that the heap was corrupted at some point before then and the malloc failed chasing pointers, except A) glibc dies with an assert() when that happens (rather than allocation functions randomly returning NULL), B) you ran it under valgrind which should catch any out of bounds write even to elsewhere in the heap if it's not cleanly within an allocation boundary.

When scripts/make.sh calls mkflags it snapshots its input with tee, producing generated/flags.raw for later analysis, because flag processing is tricksy and occasionally non-obvious and has gone wrong a few times over the years (albeit when adding new commands, not when building a release version on someone else's system). But I never bothered doing that with config2help.c because in the 9 years since I replaced the python build script with a C implementation it's literally never come up.

Still, if you wanted to add a call to tee into the pipeline on scripts/make.sh line 266 and send me the input going to config2help, I can try to reproduce your issue here?

landley commented 9 months ago

Ah, after going and reading the gentoo bug and then coming back and rereading this one, it's not strndup() returning NULL, it's strndup() being CALLED on a null. My bad, I misread, and that's different. Hmmm...

The strndup(usage) is gated by if (catch->help && (that = keyword("usage:", catch->help->data))) and although keyword("usage:") can return NULL we don't enter the block when it does. The pointer "throw" is initialized to 0 and re-initialized that way each outer loop, and inside the block if (!throw) usage = that; precedes the strndup(). The only way we get a nonsense usage value is by looping back around with throw already set, which should never happen but apparently is?

(This is vestigial code, it's for collating the help text from sub-options like busybox has a zillion of, which toybox started eliminating years ago and has almost none of left, ala commits 16c0ba51db1c and 1aaef2d2b728 and so on. This part of the code glued together multiple usage: lines to produce one usage: line listing all the command line options of both. 2/3 of what this does is obsolete now and I plan to remove it at some point, that's part of the reason it hasn't gotten a lot of attention in forever...)

Hmmm... I think for what you're seeing to trigger, you'd have to have a command with no usage: line with a sub-option that DID have a usage: line, and then it would try to glue the additional usage line to nothing. Which should never happen because it's broken input (the help text in the config entries is wrong), but the code should produce a way more graceful error message if so?

Your build is still going to break but it should do it with an "impossible input" error message saying which command it was processing at the time so you can track down what help text change made it go weird.

(It's also theoretically possible that your .config could trigger a previously untested combo, providing impossible inputs that way, but that's still a bug in the input if so. The config dependencies shouldn't allow that...)

thesamesam commented 9 months ago

Interestingly, 0.8.9 now breaks for me too when I try it again, which is consistent with what you're saying ofc. I guess newer glibc triggers it.

landley commented 9 months ago

Could be new libc, new compiler, a change in the runtime environment... I still want to fix it. It's just awkward without having locally reproduced it.

Rather than try to fix the old help text collating tool nobody's looked at in years, I've shuffled the TODO list around a bit and am looking at removing the remaining command subfeatures that need collating, or at least eliminating the separate help text so this tool doesn't need to stitch them together anymore. That said, there are several of them: MKNOD_Z, ID_Z, MKDIR_Z, and MKFIFO_Z add the -Z options for selinux, and SORT_FLOAT adds the -g option. Possibly there's some OTHER kind of signaling I can do here? (There's also some other stuff like PASSWD_SAD and WGET_LIBTLS that don't really NEED to add help text, it adds an automatic capability that doesn't really need to be separately documented. The first is the terrible password checking heuristic and the second is https as well as http support.)

Gimme a couple days to try to clear a couple other pending todo items and think about this one a bit. Poke me monday if I haven't fixed this by then?

thesamesam commented 8 months ago

ping as requested (although I'm in no hurry, honest)