rui314 / mold

Mold: A Modern Linker 🦠
MIT License
14.39k stars 470 forks source link

Linking musl with mold causes issues with global variables from libc #1071

Closed aabacchus closed 1 year ago

aabacchus commented 1 year ago

mold version: 2.0.0 musl version: 1.2.4

I recently rebuilt musl and used mold to link it, and subsequently experienced segfaults and bugs in a lot of random programs. After some digging, I found that the problems were all from globals from musl (program_invocation_short_name and optind in particular). Using a different linker to link musl fixed the problems.

Interestingly, programs built with clang didn't have these problems. Consider this C program:

#define _GNU_SOURCE
#include <errno.h>
#include <stdio.h>

int
main(void) {
    puts(program_invocation_short_name);
    return 0;
}

This program, built with GCC against musl linked with mold, segfaults when puts tries to dereference a NULL pointer.

The difference between clang and GCC is how the global is accessed. GCC does this:

        movq    program_invocation_short_name(%rip), %rdi
        call    puts@PLT

but clang does this:

        movq    program_invocation_short_name@GOTPCREL(%rip), %rax
        movq    (%rax), %rdi
        call    puts@PLT

I have confirmed that the use of @GOTPCREL fixes the GCC program.

The first version always gets NULL, the second gets the correct value initialised by musl. Similarly with optind, in GCC programs, optind is always 1 even after calling getopt, but clang programs can read the updated value. Now, this is quickly approaching the limits of my understanding. Please let me know if I can help with more testing.

This happened to me once before a few months ago, but since then I had forgotten how I fixed it.

rui314 commented 1 year ago

I couldn't reproduce the issue on my machine, so I need your input files. Can you run the last link command with --repro (or -Wl,--repro)? With that option, mold collects all input object files and put them into a tar file. Please upload the generated tar file here so that I can download. Thanks.

aabacchus commented 1 year ago

Attached is the tarball for the program which segfaults for me. Would you rather have the tarball from linking musl itself? gcc_bad.repro.tar.gz

Statically linked executables don't have this problem.

rui314 commented 1 year ago

I build your program with the given tarball, and the resulting executable worked without crashing in my Alpine/musl Docker container. It is likely that the executable itself isn't actually broken.

So you wrote that you build musl yourself. Are you sure your musl is fine?

aabacchus commented 1 year ago

in my Alpine/musl

If you provided a different libc.so, then yes it would have worked. Here is the tarball of the link step for musl: libc.so.repro.tar.gz

Yes, my musl is fine when linked with other linkers.

rui314 commented 1 year ago

It seems your reproducer fails really only when it was loaded by your musl libc.so. I built musl 1.2.4 myself and tried to run your program under my musl (i.e. run the program as /path/to/musl/builddir/libc.so gcc_bad) and it didn't crash.

The fact that your program didn't crash with other linkers doesn't immediately mean that your musl is fine; it might happen to work for some program (think C's undefined behavior).

How did you build your musl? What is your distro? How can I reproduce your binaries from scratch?

I also want to make sure you didn't apply your local patch to your musl.

aabacchus commented 1 year ago

To clarify, were you able to use my libc.so.repro.tar.gz to link a libc.so, which did not crash? That's bizarre. Maybe the compiler used for musl is also important.

It's not just this one off, its a large number of programs which crash or have bugs.

I have not patched musl, it is built normally (./configure; make in a fresh tarball reproduces the bug). My distribution is KISS, and we do patch mold to build only for amd64, but removing the patch I can still reproduce this. If you'd like some brief instructions to set up a KISS chroot let me know.

rui314 commented 1 year ago

I could reproduce the issue with the musl built from your object files, but that's not really debuggable because it's just .o files. It's not that different from libc.so in libc.so.repro.tar.gz from the debugging point of view.

If KISS Linux provides an official docker image, I can fire it up and try it myself.

aabacchus commented 1 year ago

We don't have an official docker image but I've created one. I think it should work if you run

docker run -it aabacchus/kiss sh

(the image is here). I'm not particularly familiar with Docker but I have tested it and can still reproduce the issue.

When you are in the image, you will have to do the following:

rui314 commented 1 year ago

Thanks for the info. How can I build musl with debug info?

aabacchus commented 1 year ago

Sure. You need to go into the repository for musl and edit its build script:

cd ~/repos/repo/core/musl/
vi build

Uncomment the :>nostrip line (which tells kiss not to strip the libraries) and uncomment the --enable-debug flag to configure. You should also delete the comment line above --enable-debug so that the flag is correctly passed to configure.

If you want to be able to step through the source while debugging, you'll need to add something like this to the top of the build file:

export CFLAGS="$CFLAGS -fdebug-prefix-map=$PWD=/usr/src/musl-1.2.4"

and then put the musl source in /usr/src/musl-1.2.4:

mkdir -p /usr/src
cd /usr/src
kiss d musl
tar xzf ~/.cache/kiss/sources/musl/musl-1.2.4.tar.gz

Finally you can kiss b musl.

LinuxUserGD commented 1 year ago

I recently rebuilt musl and used mold to link it, and subsequently experienced segfaults and bugs in a lot of random programs.

Mimalloc pointers (see https://github.com/microsoft/mimalloc/issues/360#issuecomment-1797331206 and https://bugs.gentoo.org/917089) are somehow pointing to the wrong heap space after linking musl with mold, causing segfaults when compiling with Clang.

I can reproduce it with a Gentoo stage3 tarball: https://distfiles.gentoo.org/releases/amd64/autobuilds/current-stage3-amd64-musl-llvm/

CMake Error at /usr/share/cmake/Modules/CMakeTestCCompiler.cmake:67 (message):
  The C compiler

    "/usr/lib/llvm/16/bin/x86_64-gentoo-linux-musl-clang"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /var/tmp/portage/sys-libs/libcxx-16.0.6/work/runtimes_build-abi_x86_64.amd64/CMakeFiles/CMakeScratch/TryCompile-ankcJC

    Run Build Command(s):/usr/bin/ninja -v cmTC_391b2 && [1/2] /usr/lib/llvm/16/bin/x86_64-gentoo-linux-musl-clang    -O2 -pipe -march=native -mtune=native -D_FORTIFY_SOURCE=3 -g0 -flto -MD -MT CMakeFiles/cmTC_391b2.dir/testCCompiler.c.o -MF CMakeFiles/cmTC_391b2.dir/testCCompiler.c.o.d -o CMakeFiles/cmTC_391b2.dir/testCCompiler.c.o -c /var/tmp/portage/sys-libs/libcxx-16.0.6/work/runtimes_build-abi_x86_64.amd64/CMakeFiles/CMakeScratch/TryCompile-ankcJC/testCCompiler.c
    [2/2] : && /usr/lib/llvm/16/bin/x86_64-gentoo-linux-musl-clang -O2 -pipe -march=native -mtune=native -D_FORTIFY_SOURCE=3 -g0 -flto -O2 -pipe -march=native -mtune=native -D_FORTIFY_SOURCE=3 -g0 -Wl,-O3 -Wl,--as-needed -Wl,--strip-debug -Wl,--undefined-version -Wl,--icf=safe -Wl,--threads=4 -Wl,--compress-debug-sections=none -fuse-ld=mold -rtlib=compiler-rt -unwindlib=libunwind CMakeFiles/cmTC_391b2.dir/testCCompiler.c.o -o cmTC_391b2   && :
    FAILED: cmTC_391b2 
    : && /usr/lib/llvm/16/bin/x86_64-gentoo-linux-musl-clang -O2 -pipe -march=native -mtune=native -D_FORTIFY_SOURCE=3 -g0 -flto -O2 -pipe -march=native -mtune=native -D_FORTIFY_SOURCE=3 -g0 -Wl,-O3 -Wl,--as-needed -Wl,--strip-debug -Wl,--undefined-version -Wl,--icf=safe -Wl,--threads=4 -Wl,--compress-debug-sections=none -fuse-ld=mold -rtlib=compiler-rt -unwindlib=libunwind CMakeFiles/cmTC_391b2.dir/testCCompiler.c.o -o cmTC_391b2   && :
    mimalloc: error: mi_free: pointer does not point to a valid heap space: 0x7f1fbad089b0
    clang-16: error: unable to execute command: Segmentation fault (core dumped)
    clang-16: error: linker command failed due to signal (use -v to see invocation)
    ninja: build stopped: subcommand failed.
rui314 commented 1 year ago

I built mold in the gentoo:stage3-musl docker container, replaced /usr/bin/ld with mold, built musl with emerge musl and built clang with emerge clang. All of it worked fine. I didn't observe any failures. How exactly can I reproduce the issue?

LinuxUserGD commented 1 year ago

@rui314 Should be reproducible in a stage3-musl-llvm chroot after recompiling llvm with binutils-plugin and recompiling musl with clang and ld.mold

COMMON_FLAGS="-O2 -pipe -march=native -mtune=native -D_FORTIFY_SOURCE=3 -g0 -flto"
CC="clang"
CXX="clang++"
CFLAGS="${COMMON_FLAGS}"
CXXFLAGS="${COMMON_FLAGS} -stdlib=libc++"
FCFLAGS="${COMMON_FLAGS}"
FFLAGS="${COMMON_FLAGS}"
LDFLAGS="${COMMON_FLAGS} ${LDLIBS} -Wl,-O3 -Wl,--as-needed -Wl,--strip-debug -Wl,--undefined-version -Wl,--icf=safe -Wl,--threads=4 -Wl,--compress-debug-sections=none -fuse-ld=mold -rtlib=compiler-rt -unwindlib=libunwind"
CHOST="x86_64-gentoo-linux-musl"
ACCEPT_KEYWORDS="amd64 ~amd64"
LD="ld.mold"
LC_MESSAGES=C
EMERGE_DEFAULT_OPTS="${EMERGE_DEFAULT_OPTS}"
MAKEOPTS="-j4"
emerge -1 =sys-libs/musl-1.2.3* sys-libs/libcxx --exclude=sys-devel/llvm
aabacchus commented 1 year ago

@LinuxUserGD isn't it mold segfaulting in your case, not a program linked to musl built with mold?

LinuxUserGD commented 1 year ago

@LinuxUserGD isn't it mold segfaulting in your case, not a program linked to musl built with mold?

Yes, mold segfaults with -flto when musl is compiled with mold. After rebuilding musl with lld, linking with mold completes without the mimalloc error.

Starting program: /usr/bin/ld.mold -pie --hash-style=gnu --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib/ld-musl-x86_64.so.1 -o a.out /lib/Scrt1.o /lib/crti.o /usr/lib/llvm/16/bin/../../../../lib/clang/16/lib/linux/clang_rt.crtbegin-x86_64.o -L/lib -L/usr/lib -plugin /usr/lib/llvm/16/bin/../lib/LLVMgold.so -plugin-opt=mcpu=skylake -plugin-opt=O2 -z relro -z now -O3 --as-needed --strip-debug --undefined-version --icf=safe --threads=4 --compress-debug-sections=none /tmp/check_cxx11-b34c02.o -lc++ -lm /usr/lib/llvm/16/bin/../../../../lib/clang/16/lib/linux/libclang_rt.builtins-x86_64.a --as-needed -lunwind --no-as-needed -lc /usr/lib/llvm/16/bin/../../../../lib/clang/16/lib/linux/libclang_rt.builtins-x86_64.a --as-needed -lunwind --no-as-needed /usr/lib/llvm/16/bin/../../../../lib/clang/16/lib/linux/clang_rt.crtend-x86_64.o /lib/crtn.o
[Detaching after fork from child process 232385]
mimalloc: error: mi_free: pointer does not point to a valid heap space: 0x7ffff7e36c50

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7fd7de7 in setjmp () from /lib/ld-musl-x86_64.so.1
aabacchus commented 1 year ago

@rui314 I made a docker image with the above commands run, so that it contains the buggy musl. Just

docker run -it aabacchus/test sh
cc test.c
./a.out

to reproduce.

rui314 commented 1 year ago

Thank you, everyone. I successfully reproduced the issue following your instructions. It's a challenging issue to debug, but it appears to be related to a subtle bug in weak symbol handling. I will prepare a fix.

rui314 commented 1 year ago

This was a bad bug, thank you again for reporting. I believe the above commit fixed the issue. Can you try again with the git head?

aabacchus commented 1 year ago

da3f5dd

It seems to be fixed, thank you!

LinuxUserGD commented 1 year ago

The mimalloc segfault is fixed by da3f5dd as well, thanks!