opensvc / multipath-tools

Other
60 stars 48 forks source link

multipathd segmentation fault (-Bsymbolic-functions in LDFLAGS) #26

Closed athos-ribeiro closed 2 years ago

athos-ribeiro commented 2 years ago

The b29b5fd commit mentions that the config struct should be initialized with init_config.

Once b29b5fd is applied in Ubuntu's multipath-tools, the multipathd service no longer starts upon a system boot and a segmentation fault is thrown.

The situation improves once 6236b5a is applied, ensuring the service can start. However, the segmentation fault is still observed in specific cases such as upon autopkgtest runs of debian/tests/tgtbasedmpaths.

It seems that the issue is triggered in specific configuration read situations in multipathd. I am still working on a minimal reproducer.

This following patch fixes such occurrences by initializing the configuration file with the init_config, as suggested in b29b5fd.

--- a/multipathd/main.c
+++ b/multipathd/main.c
@@ -3188,6 +3188,8 @@

    if (verbosity)
        libmp_verbosity = verbosity;
+   if (init_config(DEFAULT_CONFIGFILE))
+       return -1;
    conf = load_config(DEFAULT_CONFIGFILE);
    if (verbosity)
        libmp_verbosity = verbosity;

Finally, here is a dump of the code path that is leading to the segfault:

Forwarding 1 uevents

Thread 7 "multipathd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6e76600 (LWP 9550)]
0x00007ffff7f8617a in ?? () from /lib/libmultipath.so.0
#0  0x00007ffff7f8617a in ?? () from /lib/libmultipath.so.0
#1  0x00007ffff7f8633e in uevent_dispatch () from /lib/libmultipath.so.0
#2  0x000055555555d464 in uevqloop (ap=0x55555559f270) at ./multipathd/main.c:1575
#3  0x00007ffff7bfcb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007ffff7c8eb80 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
#0  0x00007ffff7f8617a in ?? () from /lib/libmultipath.so.0
No symbol table info available.
#1  0x00007ffff7f8633e in uevent_dispatch () from /lib/libmultipath.so.0
No symbol table info available.
#2  0x000055555555d464 in uevqloop (ap=0x55555559f270) at ./multipathd/main.c:1575
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {140737335748096, -2300306751447238166, 140737335748096, 0, 140737349929040, 140737488346512, 2300291561074588138, 5384336589863647722}, __mask_was_saved = 0}}, __pad = {0x7ffff6e75ab0, 0x0, 0x0, 0x0}}
        __cancel_routine = 0x55555555d170 <rcu_unregister>
        __cancel_arg = 0x0
        __not_first_call = <optimized out>
#3  0x00007ffff7bfcb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737488346160, -2300306751447238166, 140737335748096, 0, 140737349929040, 140737488346512, 2300291561059908074, 2300288611059235306}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#4  0x00007ffff7c8eb80 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 7 (Thread 0x7ffff6e76600 (LWP 9550) "multipathd"):
#0  0x00007ffff7f8617a in ?? () from /lib/libmultipath.so.0
No symbol table info available.
#1  0x00007ffff7f8633e in uevent_dispatch () from /lib/libmultipath.so.0
No symbol table info available.
#2  0x000055555555d464 in uevqloop (ap=0x55555559f270) at ./multipathd/main.c:1575
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {140737335748096, -2300306751447238166, 140737335748096, 0, 140737349929040, 140737488346512, 2300291561074588138, 5384336589863647722}, __mask_was_saved = 0}}, __pad = {0x7ffff6e75ab0, 0x0, 0x0, 0x0}}
        __cancel_routine = 0x55555555d170 <rcu_unregister>
        __cancel_arg = 0x0
        __not_first_call = <optimized out>
#3  0x00007ffff7bfcb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737488346160, -2300306751447238166, 140737335748096, 0, 140737349929040, 140737488346512, 2300291561059908074, 2300288611059235306}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#4  0x00007ffff7c8eb80 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 6 (Thread 0x7ffff6e87600 (LWP 9549) "multipathd"):
#0  0x00007ffff7c4d868 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffff6e867d0, rem=rem@entry=0x7ffff6e867d0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
        r = <optimized out>
#1  0x00007ffff7c526e7 in __GI___nanosleep (req=req@entry=0x7ffff6e867d0, rem=rem@entry=0x7ffff6e867d0) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
        ret = <optimized out>
#2  0x00007ffff7c5261e in __sleep (seconds=0, seconds@entry=1) at ../sysdeps/posix/sleep.c:55
        save_errno = 0
        max = 4294967295
        ts = {tv_sec = 0, tv_nsec = 582993006}
#3  0x00005555555671ad in checkerloop (ap=0x55555559f270) at ./multipathd/main.c:2518
        diff_time = {tv_sec = 0, tv_nsec = 17023}
        start_time = {tv_sec = 2298, tv_nsec = 60196451}
        num_paths = <optimized out>
        max_checkint = <optimized out>
        end_time = {tv_sec = 2298, tv_nsec = 60213474}
        strict_timing = 0
        rc = <optimized out>
        ticks = <optimized out>
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {140737335817728, -2300306751447238166, 140737335817728, 0, 140737349929040, 140737488346512, 2300291448935676394, 5384336610387425770}, __mask_was_saved = 0}}, __pad = {0x7ffff6e86ab0, 0x0, 0x0, 0x0}}
        __cancel_routine = 0x55555555d170 <rcu_unregister>
        __cancel_arg = 0x0
        __not_first_call = <optimized out>
        vecs = 0x55555559f270
        pp = <optimized out>
        count = <optimized out>
        i = <optimized out>
        last_time = {tv_sec = 2298, tv_nsec = 60196451}
        conf = <optimized out>
        foreign_tick = 5
        use_watchdog = false
#4  0x00007ffff7bfcb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737488346160, -2300306751447238166, 140737335817728, 0, 140737349929040, 140737488346512, 2300291448853887466, 2300288611059235306}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#5  0x00007ffff7c8eb80 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 5 (Thread 0x7ffff6ec8600 (LWP 9548) "multipathd"):
#0  0x00007ffff7c80d7f in __GI___poll (fds=0x7ffff6ec78b8, nfds=1, timeout=30000) at ../sysdeps/unix/sysv/linux/poll.c:29
        sc_ret = -516
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
#1  0x00007ffff7f8110f in uevent_listen () from /lib/libmultipath.so.0
No symbol table info available.
#2  0x000055555555d39d in ueventloop (ap=0x555555575940) at ./multipathd/main.c:1564
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {140737336083968, -2300306751447238166, 140737336083968, 0, 140737349929040, 140737488346512, 2300291482691434986, 5384336589687486954}, __mask_was_saved = 0}}, __pad = {0x7ffff6ec7ab0, 0x0, 0x0, 0x0}}
        __cancel_routine = 0x55555555d170 <rcu_unregister>
        __cancel_arg = 0x0
        __not_first_call = <optimized out>
        udev = 0x555555575940
#3  0x00007ffff7bfcb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737488346160, -2300306751447238166, 140737336083968, 0, 140737349929040, 140737488346512, 2300291482676754922, 2300288611059235306}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#4  0x00007ffff7c8eb80 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 4 (Thread 0x7ffff6ed9600 (LWP 9547) "multipathd"):
#0  0x00007ffff7c80d7f in __GI___poll (fds=fds@entry=0x7ffff6ed8828, nfds=nfds@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
        sc_ret = -516
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
#1  0x000055555556214d in poll (__timeout=-1, __nfds=1, __fds=0x7ffff6ed8828) at /usr/include/x86_64-linux-gnu/bits/poll2.h:39
No locals.
#2  dmevent_loop () at ./multipathd/dmevents.c:299
        r = <optimized out>
        i = <optimized out>
        pfd = {fd = 6, events = 1, revents = 0}
        dev_evt = <optimized out>
#3  0x000055555556276a in wait_dmevents (unused=<optimized out>) at ./multipathd/dmevents.c:397
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {140737336153600, -2300306751447238166, 140737336153600, 0, 140737349929040, 140737488346512, 2300291477861693930, 5384336608620575210}, __mask_was_saved = 0}}, __pad = {0x7ffff6ed8ab0, 0x0, 0x0, 0x0}}
        __cancel_routine = 0x55555555d170 <rcu_unregister>
        __cancel_arg = <optimized out>
        __not_first_call = <optimized out>
        r = <optimized out>
#4  0x00007ffff7bfcb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737488346160, -2300306751447238166, 140737336153600, 0, 140737349929040, 140737488346512, 2300291477844916714, 2300288611059235306}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#5  0x00007ffff7c8eb80 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 3 (Thread 0x7ffff6eea600 (LWP 9546) "multipathd"):
#0  0x00007ffff7c80e7e in __ppoll (fds=0x7ffff0005140, nfds=nfds@entry=2, timeout=<optimized out>, timeout@entry=0x555555574040 <sleep_time>, sigmask=sigmask@entry=0x7ffff6ee94d0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
        sc_ret = -514
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
        tval = {tv_sec = 4, tv_nsec = 985922914}
#1  0x000055555556aba3 in ppoll (__ss=0x7ffff6ee94d0, __timeout=0x555555574040 <sleep_time>, __nfds=2, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:64
No locals.
#2  uxsock_listen.constprop.0.isra.0 (ux_sock=ux_sock@entry=5, trigger_data=trigger_data@entry=0x55555559f270, uxsock_trigger=<optimized out>) at ./multipathd/uxlsnr.c:358
        c = <optimized out>
        tmp = <optimized out>
        i = 2
        n_pfds = 2
        num_clients = 0
        poll_count = <optimized out>
        rlen = 0
        inbuf = 0x0
        reply = 0x0
        mask = {__val = {18446744067267083772, 0 <repeats 15 times>}}
        old_clients = 0
        sequence_nr = 1
        wds = {conf_wd = 1, dir_wd = -1}
#3  0x00005555555691a7 in uxlsnrloop (ap=0x55555559f270) at ./multipathd/main.c:1663
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {140737336223232, -2300306751447238166, 140737336223232, 0, 140737349929040, 140737488346512, 2300291503178512874, 5384336613721635306}, __mask_was_saved = 0}}, __pad = {0x7ffff6ee99b0, 0x0, 0x0, 0x0}}
        __cancel_routine = 0x5555555681e0 <uxsock_cleanup>
        __cancel_arg = 0x5
        __not_first_call = <optimized out>
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {140737336223232, -2300306751447238166, 140737336223232, 0, 140737349929040, 140737488346512, 2300291503178512874, 5384336613730810346}, __mask_was_saved = 0}}, __pad = {0x7ffff6ee9ab0, 0x0, 0x0, 0x0}}
        __cancel_routine = 0x55555555d170 <rcu_unregister>
        __cancel_arg = 0x0
        __not_first_call = <optimized out>
        ux_sock = 5
#4  0x00007ffff7bfcb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737488346160, -2300306751447238166, 140737336223232, 0, 140737349929040, 140737488346512, 2300291503077849578, 2300288611059235306}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#5  0x00007ffff7c8eb80 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 2 (Thread 0x7ffff76f0600 (LWP 9545) "multipathd"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1  0x00007ffff7f190e2 in ?? () from /lib/x86_64-linux-gnu/liburcu.so.8
No symbol table info available.
#2  0x00007ffff7bfcb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737488346064, -2300306751447238166, 140737344636416, 2, 140737349929040, 140737488346416, 2300290404639963626, 2300288611059235306}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#3  0x00007ffff7c8eb80 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Thread 1 (Thread 0x7ffff76f1a80 (LWP 9541) "multipathd"):
#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x555555574248 <config_cond.lto_priv+40>) at ./nptl/futex-internal.c:57
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
        resultvar = <optimized out>
        __arg6 = <optimized out>
        __arg5 = <optimized out>
        __arg4 = <optimized out>
        __arg3 = <optimized out>
        __arg2 = <optimized out>
        __arg1 = <optimized out>
        _a6 = <optimized out>
        _a5 = <optimized out>
        _a4 = <optimized out>
        _a3 = <optimized out>
        _a2 = <optimized out>
        _a1 = <optimized out>
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x555555574248 <config_cond.lto_priv+40>) at ./nptl/futex-internal.c:87
        err = <optimized out>
        clockbit = 256
        op = 393
        err = <optimized out>
        clockbit = <optimized out>
        op = <optimized out>
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555555574248 <config_cond.lto_priv+40>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
No locals.
#3  0x00007ffff7bfbac1 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x555555574260 <config_lock.lto_priv>, cond=0x555555574220 <config_cond.lto_priv>) at ./nptl/pthread_cond_wait.c:503
        spin = 0
        buffer = {__routine = 0x7ffff7bfb7a0 <__condvar_cleanup_waiting>, __arg = 0x7fffffffde00, __canceltype = 1431958128, __prev = 0x0}
        cbuffer = {wseq = 36, cond = 0x555555574220 <config_cond.lto_priv>, mutex = 0x555555574260 <config_lock.lto_priv>, private = 0}
        err = <optimized out>
        g = 0
        flags = <optimized out>
        g1_start = <optimized out>
        maxspin = 0
        signals = <optimized out>
        result = 0
        wseq = 36
        seq = 18
        private = 0
        maxspin = <optimized out>
        err = <optimized out>
        result = <optimized out>
        wseq = <optimized out>
        g = <optimized out>
        seq = <optimized out>
        flags = <optimized out>
        private = <optimized out>
        signals = <optimized out>
        done = <optimized out>
        g1_start = <optimized out>
        spin = <optimized out>
        buffer = <optimized out>
        cbuffer = <optimized out>
        s = <optimized out>
#4  ___pthread_cond_wait (cond=0x555555574220 <config_cond.lto_priv>, mutex=0x555555574260 <config_lock.lto_priv>) at ./nptl/pthread_cond_wait.c:627
No locals.
#5  0x000055555556bf5e in child.constprop.0 (param=0x0) at ./multipathd/main.c:3214
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {93824992539248, 5384336485313804778, 93824992369488, 4294966296, 93824992343255, 93824992339584, 2300306752503937514, 5384336612103813610}, __mask_was_saved = 0}}, __pad = {0x7fffffffe3f0, 0x0, 0x7ffff7fdf670 <_dl_audit_preinit>, 0xfffffffffffffff8}}
        __cancel_routine = <optimized out>
        __cancel_arg = <optimized out>
        __not_first_call = 0
        log_attr = {__size = "\001\000\000\000\377\177\000\000\026\000\000\000\000\000\000\000\t\000\000\000\000\000\000\000P\340\377\377\377\177", '\000' <repeats 25 times>, __align = 140733193388033}
        misc_attr = {__size = '\000' <repeats 17 times>, "\020", '\000' <repeats 16 times>, "\001", '\000' <repeats 20 times>, __align = 0}
        uevent_attr = {__size = '\000' <repeats 17 times>, "\020", '\000' <repeats 16 times>, "\004", '\000' <repeats 20 times>, __align = 0}
        vecs = 0x55555559f270
        rc = <optimized out>
        conf = <optimized out>
        envp = <optimized out>
        state = <optimized out>
        exit_code = 1
#6  0x000055555555c87f in main (argc=5, argv=<optimized out>) at ./multipathd/main.c:3421
        arg = <optimized out>
        err = 0
        foreground = 1
        conf = <optimized out>
athos-ribeiro commented 2 years ago

The issue happens due to injection of the -Bsymbolic-functions LDFLAGS in Ubuntu. This is definitely an issue to be discussed for the packaging then. Sorry for the noise :)

mwilck commented 2 years ago

Right, b29b5fd relies on the linker's default behavior to override library-provided symbols.

Using -Bsymbolic-functions is probably some sort of optimization, avoiding unnecessary symbol lookups during startup. I wonder if we (as upstream) have some way to make sure that symbols like get_multipath_config are treated with the default behavior even if other functions from the same library use -Bsymbolic_functions.

Hints appreciated.

mwilck commented 2 years ago

get_multipath_config() and put_multipath_config() are defined as weak symbols in the source code. My expectation was that weak symbols were unaffected by -Bsymbolic-functions, but that's not the case. References to get_multipath_config() are directly replaced by libmp_get_multipath_config() by the linker, which is of course wrong.

I have found no way to annotate these functions such that they are exempted from -Bsymbolic-functions. There are recent proposals to add a -Bsymbolic-non-weak-functions option, which might be just what we could use here, but they don't seem to be available yet.

mwilck commented 2 years ago

@athos-ribeiro , after discussion with our toolchain experts, could you try this, and see if the error goes away?

LDFLAGS='-Wl,-Bsymbolic-functions -Wl,--export-dynamic-symbol=get_multipath_config -Wl,--export-dynamic-symbol=put_multipath_config'

The --export-dynamic-symbol option should override the effect of -Bsymbolic-functions. It worked in my testing, but I'd like you to confirm.

mwilck commented 2 years ago

I've pushed a tentative patch to https://github.com/mwilck/multipath-tools/tree/dynamic-list. Please test if you're able to compile and run that code with LDFLAGS='-Wl,-Bsymbolic-functions'.

mwilck commented 2 years ago

I'm not going to make an official patch with -Wl,--dynamic-list. In the meantime, I had a discussion with one of the toolchain maintainers at SUSE, and he came up with detailed explanation about -Bsymbolic-functions and why it shouldn't be used. See this stackoverflow post.

mwilck commented 2 years ago

Closing per previous comment.