radvd-project / radvd

radvd | Official repository: https://github.com/radvd-project/radvd
https://radvd.litech.org/
Other
203 stars 107 forks source link

radvd encounters segmentation fault on boot #174

Open johnkisch opened 2 years ago

johnkisch commented 2 years ago

Hello,

I'm having an issue where the radvd daemon (version 2.19) encounters a segmentation fault when the daemon is started at boot time.

The system with this issue is running Alpine Linux 3.15:

hydra:~# cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.15.0
PRETTY_NAME="Alpine Linux v3.15"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"
hydra:~#

NOTE: Alpine Linux is based on the musl C Standard Library. Alpine Linux uses the OpenRC init system.

I was able to capture a coredump, the backtrace reads as follows:

Reading symbols from /usr/sbin/radvd...
(No debugging symbols found in /usr/sbin/radvd)
[New LWP 2596]
Core was generated by `/usr/sbin/radvd -C /etc/radvd.conf -p /run/radvd/radvd.pid -u radvd'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007ffb03ac45fc in strcmp (l=0x7ffb03b0b140 "eth1", r=0x1495ba68 <error: Cannot access memory at address 0x1495ba68>) at src/string/strcmp.c:5
5       src/string/strcmp.c: No such file or directory.
(gdb) backtrace
#0  0x00007ffb03ac45fc in strcmp (l=0x7ffb03b0b140 "eth1", r=0x1495ba68 <error: Cannot access memory at address 0x1495ba68>) at src/string/strcmp.c:5
#1  0x0000561756851e04 in ?? ()
#2  0x0000561756858125 in ?? ()
#3  0x00005617568531c5 in ?? ()
#4  0x0000561756850108 in ?? ()
#5  0x00007ffb03a90a03 in libc_start_main_stage2 (main=0x56175684f610, argc=7, argv=0x7ffe1495d458) at src/env/__libc_start_main.c:94
#6  0x00005617568501c9 in ?? ()
#7  0x0000000000000007 in ?? ()
#8  0x00007ffe1495eebe in ?? ()
#9  0x00007ffe1495eece in ?? ()
#10 0x00007ffe1495eed1 in ?? ()
#11 0x00007ffe1495eee1 in ?? ()
#12 0x00007ffe1495eee4 in ?? ()
#13 0x00007ffe1495eef9 in ?? ()
#14 0x00007ffe1495eefc in ?? ()
#15 0x0000000000000000 in ?? ()
(gdb) 

After the system is finished booting, rc-service radvd stop; rc-service radvd start results in the daemon starting successfully. Perhaps the radvd daemon is being started before the eth1 device is up? This would make sense as to why the daemon starts successfully after the system finishes coming up.

I've also opened an issue with the folks at Alpine Linux, as seen here:

https://gitlab.alpinelinux.org/alpine/aports/-/issues/13570

Please let me know if there's any further information that I can provide.

Thanks.

stappersg commented 2 years ago

On Tue, Mar 01, 2022 at 11:19:36PM -0800, John Kisch wrote:

I'm having an issue where the radvd daemon encounters a segmentation fault when the daemon is started at boot time. The system with this issue is running Alpine Linux 3.15

Alpine Linux uses the OpenRC init system.

After the system is finished booting, rc-service radvd stop; rc-service radvd start results in the daemon starting successfully. Perhaps the radvd daemon is being started before the eth1 device is up? This would make sense as to why the daemon starts successfully after the system finishes coming up.

I've also opened an issue with the folks at Alpine Linux, as seen here:

https://gitlab.alpinelinux.org/alpine/aports/-/issues/13570

Please let me know if there's any further information that I can provide.

See if the OpenRC init system has something like

start this proces after the network interfaces are up

Groeten Geert Stappers -- Silence is hard to parse

johnkisch commented 2 years ago

See if the OpenRC init system has something like start this proces after the network interfaces are up Groeten Geert Stappers -- Silence is hard to parse

OpenRC has a parameter that can be set in /etc/rc.conf called rc_depend_strict which essentially will not allow services that depend on net to start before all interfaces are up. I have the following configured in my /etc/rc.conf:

# Do we allow any started service in the runlevel to satisfy the dependency
# or do we want all of them regardless of state? For example, if net.eth0
# and net.eth1 are in the default runlevel then with rc_depend_strict="NO"
# both will be started, but services that depend on 'net' will work if either
# one comes up. With rc_depend_strict="YES" we would require them both to
# come up.
rc_depend_strict="YES"

I still receive a segfault for radvd on boot with this set in /etc/rc.conf.

robbat2 commented 2 years ago
johnkisch commented 2 years ago

Hi Robin,

Whoops, missed that! This is radvd version 2.19.

hydra:~# radvd --version
Version: 2.19

Compiled in settings:
  default config file           "/etc/radvd.conf"
  default pidfile               "/run/radvd/radvd.pid"
  default logfile               "/var/log/radvd.log"
  default syslog facility       24
Please send bug reports or suggestions to Reuben Hawkins <reubenhwk@gmail.com>.
hydra:~#

I'm using ifupdown-ng for interface configuration. Neither rc_after=net.eth1 nor rc_need=net.eth1 in /etc/conf.d/radvd work to resolve the issue, unfortunately.

I'll go ahead and give building directly from the repo a shot here and post an update once I do so.

Thanks!

robbat2 commented 2 years ago

@johnkisch did the latest version work for you?

nopeno commented 2 years ago

my alpine box has same problem. my config is:

--------- /etc/network/interfaces--------

auto lo iface lo inet loopback

allow-hotplug wan0 auto wan0 iface wan0 inet static address 192.168.1.33 netmask 255.255.255.0 broadcast 192.168.1.255 pre-up /sbin/ip link set wan0 up up ifup ppp0=telecom down ifdown ppp0=telecom post-down /sbin/ip link set wan0 up

auto ppp0 iface ppp0 inet ppp provider telecom

auto br0 iface br0 inet static bridge-ports fib1 bridge-stp 0 address 192.168.2.253 netmask 255.255.255.0

iface fib0 inet manual iface fib0 inet6 manual iface fib1 inet manual iface fib1 inet6 manual

------------------ /etc/radvd.conf --------- interface br0 { AdvSendAdvert on; AdvManagedFlag off; AdvOtherConfigFlag on; AdvLinkMTU 1480; prefix ::/64 { AdvOnLink on; AdvRouterAddr on; }; };

---- and the folloing is dmsg [ 12.144768] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this. [ 12.147709] br0: port 1(fib1) entered blocking state [ 12.147714] br0: port 1(fib1) entered disabled state [ 12.147778] device fib1 entered promiscuous mode [ 12.159114] br0: port 1(fib1) entered blocking state [ 12.159118] br0: port 1(fib1) entered forwarding state [ 12.685266] 8021q: 802.1Q VLAN Support v1.8 [ 12.685290] 8021q: adding VLAN 0 to HW filter on device fib1 [ 12.686407] 8021q: adding VLAN 0 to HW filter on device wan0 [ 12.717712] cfg80211: Loading compiled-in X.509 certificates for regulatory database [ 12.720037] cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7' [ 12.720501] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2 [ 12.720506] cfg80211: failed to load regulatory.db [ 13.704173] igb 0000:01:00.0 wan0: igb: wan0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX [ 13.704478] IPv6: ADDRCONF(NETDEV_CHANGE): wan0: link becomes ready [ 51.753464] mlx4_core 0000:04:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update [ 52.643188] Bridge firewalling registered [ 52.667737] Initializing XFRM netlink socket [ 66.521759] radvd[3472]: segfault at 4420dbfc ip 00007f3d36818f2e sp 00007ffd4420d2a8 error 4 in ld-musl-x86_64.so.1[7f3d367de000+48000] [ 66.521794] Code: ff fe fe fe fe fe fe fe 49 bb 80 80 80 80 80 80 80 80 4c 0f af c0 eb ae 5b c3 40 0f b6 f6 48 89 f8 a8 07 74 14 48 85 d2 74 7c <0f> b6 08 39 f1 74 3f 48 ff c0 48 ff ca eb e8 48 85 d2 74 68 0f b6 [ 224.471533] radvd[3932]: segfault at 27b5ed0c ip 00007fcc17451f2e sp 00007ffd27b5e3b8 error 4 in ld-musl-x86_64.so.1[7fcc17417000+48000] [ 224.471569] Code: ff fe fe fe fe fe fe fe 49 bb 80 80 80 80 80 80 80 80 4c 0f af c0 eb ae 5b c3 40 0f b6 f6 48 89 f8 a8 07 74 14 48 85 d2 74 7c <0f> b6 08 39 f1 74 3f 48 ff c0 48 ff ca eb e8 48 85 d2 74 68 0f b6 [ 650.240911] mlx4_core 0000:04:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update [ 729.496981] radvd[4300]: segfault at 6b95a33c ip 00007f392ac9bf2e sp 00007fff6b9599e8 error 4 in ld-musl-x86_64.so.1[7f392ac61000+48000] [ 729.497018] Code: ff fe fe fe fe fe fe fe 49 bb 80 80 80 80 80 80 80 80 4c 0f af c0 eb ae 5b c3 40 0f b6 f6 48 89 f8 a8 07 74 14 48 85 d2 74 7c <0f> b6 08 39 f1 74 3f 48 ff c0 48 ff ca eb e8 48 85 d2 74 68 0f b6 [ 1260.259938] mlx4_core 0000:04:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update [ 1282.004566] mlx4_core 0000:04:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update [ 1412.341256] radvd[5736]: segfault at ffffffffdaa413bc ip 00007f1975b68f2e sp 00007ffcdaa40a68 error 5 in ld-musl-x86_64.so.1[7f1975b2e000+48000] [ 1412.341293] Code: ff fe fe fe fe fe fe fe 49 bb 80 80 80 80 80 80 80 80 4c 0f af c0 eb ae 5b c3 40 0f b6 f6 48 89 f8 a8 07 74 14 48 85 d2 74 7c <0f> b6 08 39 f1 74 3f 48 ff c0 48 ff ca eb e8 48 85 d2 74 68 0f b6 [11363.172526] mlx4_en: fib1: Link Down [11363.172898] br0: port 1(fib1) entered disabled state [11432.424962] mlx4_en: fib1: Link Up [11432.426525] br0: port 1(fib1) entered blocking state [11432.426540] br0: port 1(fib1) entered forwarding state [29611.854959] mlx4_core 0000:04:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update [29726.325641] radvd[9015]: segfault at ffffffffc9763f7c ip 00007fa38a432f2e sp 00007ffdc9763628 error 5 in ld-musl-x86_64.so.1[7fa38a3f8000+48000] [29726.325678] Code: ff fe fe fe fe fe fe fe 49 bb 80 80 80 80 80 80 80 80 4c 0f af c0 eb ae 5b c3 40 0f b6 f6 48 89 f8 a8 07 74 14 48 85 d2 74 7c <0f> b6 08 39 f1 74 3f 48 ff c0 48 ff ca eb e8 48 85 d2 74 68 0f b6 [30199.993401] IPv4: martian source 255.255.255.255 from 192.168.88.1, on dev br0 [30199.993422] ll header: 00000000: ff ff ff ff ff ff 2c c8 1b a9 45 6d 08 00 [30224.546276] mlx4_core 0000:04:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update [30237.972864] radvd[9605]: segfault at ffffffffdb22d75c ip 00007f2145672f2e sp 00007ffddb22ce08 error 5 in ld-musl-x86_64.so.1[7f2145638000+48000] [30237.972900] Code: ff fe fe fe fe fe fe fe 49 bb 80 80 80 80 80 80 80 80 4c 0f af c0 eb ae 5b c3 40 0f b6 f6 48 89 f8 a8 07 74 14 48 85 d2 74 7c <0f> b6 08 39 f1 74 3f 48 ff c0 48 ff ca eb e8 48 85 d2 74 68 0f b6

nopeno commented 2 years ago

just now, it crashed againt. i found these in dmesg

[48965.408094] radvd[10207]: segfault at fffffffffd7f8018 ip 00007fdac539e5fc sp 00007ffdfd7f7f78 error 5 in ld-musl-x86_64.so.1[7fdac5363000+48000]
[48965.408115] Code: 48 09 c8 4c 85 c8 75 0d 49 83 c4 08 eb d4 39 f0 74 0c 49 ff c4 41 0f b6 04 24 84 c0 75 f0 4c 89 e0 41 5c c3 31 c9 0f b6 04 0f <0f> b6 14 0e 38 d0 75 07 48 ff c1 84 c0 75 ed 29 d0 c3 41 54 49 89
[48965.409217] br0: port 2(tap0) entered blocking state
[48965.409223] br0: port 2(tap0) entered disabled state
[48965.409377] device tap0 entered promiscuous mode
[48965.409694] br0: port 2(tap0) entered blocking state
[48965.409698] br0: port 2(tap0) entered forwarding state

it seems that radvd will crash when i change the bridge setting

PaulosV commented 2 years ago

I'd say it's more general and it crashes whenever there is a change in network interfaces (adding, removing, changing settings...). One time I was removing some interfaces on the side, doing nothing to our bridges and yet, radvd still crashed.

robbat2 commented 2 years ago

@PaulosV @nopeno were you using the latest master, or what specific version?

PaulosV commented 2 years ago

In my case, Alpine Linux v3.15 with radvd version 2.19. I will attempt running with master.

PaulosV commented 2 years ago

Running with master seems to handle things fine.

Also 2.19 compiled from source crashes, too. With the 2.19 version in the packaging system, I have now got a command and a config file to reliably crash radvd:

/usr/sbin/radvd -C /etc/radvd.conf -p /run/radvd/radvd.pid -u radvd -d 3 -n

Debug levels 2 and above trigger the crash.

This is the minimal file that triggers the crash:

interface br_lan.10 {
    AdvSendAdvert on;
    prefix fd54:2e24:1f9b:a::/64 {
    };
};
robbat2 commented 2 years ago

@PaulosV thanks for that. I don't see why that config should crash on v2.19 and not in the latest master. Most of the changes in there were build systems or new features.

If you use a dummy interface on Alpine, does it also crash, or is it some interaction between musl & vlan or bridges (a couple of the configs in the thread had bridges, which makes me wonder, e.g. if the bridge is in a non-forwarding state due to STP).

If you can spare the time to run git bisect between v2.19 & master, that would be hugely appreciated, bonus if you know your way around gdb.

Mostly I think this builds confidence to say we're good to have a v2.20 release soon.

nopeno commented 2 years ago

Alpine Linux v3.15 with radvd version 2.19.

me 2.

PaulosV commented 2 years ago

@robbat2 I'll try to do the bisect later. I'm not very comfortable in gdb but I can probably do a core dump or extract some vars/registers if needed. I'll also try the dummy interfaces.

Also, I should have probably been clearer - when running in the foreground with high enough debug level (-d2 or -d3), radvd did not need any further convincing and crashed (SIGSEGV) instantly during startup.

robbat2 commented 2 years ago

@PaulosV After you bisect to narrow it down, here's the easy way to drive gdb to convert the core to a backtrace:

gdb-trace.sh:

#!/bin/sh
exe=$1
core=$2

gdb ${exe} \
        --core ${core} \
        --batch \
        --quiet \
        -ex "thread apply all bt full" \
        -ex "quit"

tee the output to a file, and it'll be good enough.

PaulosV commented 2 years ago

Ok, so the issue was fixed by commit 06689f8c06f44c7e87f7ff1d814428f88375b53f (issue #158, PR #161). This time, I tested inside an LXC container ( images:alpine/3.15 ), and there was eth0 interface, without any bridge involved in the system. I was unable to reproduce with lo.

bash-5.1# ./radvd -n -d3                                                 
[Apr 05 21:31:29] radvd (4750): version 2.19 started
[Apr 05 21:31:29] radvd (4750): config file, /etc/radvd.conf, syntax ok
[Apr 05 21:31:29] radvd (4750): IPv6 forwarding setting is: 0, should be 1 or 2
[Apr 05 21:31:29] radvd (4750): IPv6 forwarding seems to be disabled, but continuing anyway
[Apr 05 21:31:29] radvd (4750): radvd startup PID is 4750
[Apr 05 21:31:29] radvd (4750): radvd PID is 4750
[Apr 05 21:31:29] radvd (4750): initializing privsep
[Apr 05 21:31:29] radvd (4750): radvd privsep PID is 4751
[Apr 05 21:31:29] radvd (4750): eth0 mtu: 1500
[Apr 05 21:31:29] radvd (4750): eth0 hardware type: ARPHRD_ETHER
[Apr 05 21:31:29] radvd (4750): eth0 hardware address: 00:16:3e:5e:83:6f
[Apr 05 21:31:29] radvd (4750): eth0 link layer token length: 48
[Apr 05 21:31:29] radvd (4750): eth0 prefix length: 64
[Apr 05 21:31:29] radvd (4750): IPv6 forwarding on interface seems to be disabled, but continuing anyway
[Apr 05 21:31:29] radvd (4750): polling for 16 second(s), next iface is eth0
[Apr 05 21:31:29] radvd (4751): Freeing Interfaces
[Apr 05 21:31:29] radvd (4751): Exiting, privsep_read_loop had readn return 0 bytes
[Apr 05 21:31:29] radvd (4751): Exiting, privsep_read_loop is complete.
Segmentation fault (core dumped)

The backtrace from tag v2.19: radvd-bt-2.19.log

johnkisch commented 2 years ago

Apologies for not getting back to this - life got in the way, etc.

I cloned latest on April 11th and rolled it into an apk package and installed. After giving a week of bake time, radvd has successfully started at boot time as expected every time I've tested. I think this issue has been resolved at this point. I think it would be helpful if a new release was cut so that distro maintainers can update their packages.

PaulosV commented 2 years ago

That is good to hear. I think, because the issue has quite a big impact and makes radvd downright unusable in some circumstances, it might make sense to backport that specific patch (06689f8c06f44c7e87f7ff1d814428f88375b53f) for Alpine and include it in the aports, so they can rebuild the package with the fix.

johnkisch commented 2 years ago

Here's the MR in the aports repo for this:

https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/33358

stappersg commented 2 years ago

Did see https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/33358/diffs and does understand the please do more as just git releases.

robbat2 commented 11 months ago

@johnkisch can you please confirm 2.20 rc resolves the issue for you?