Open CRCinAU opened 2 years ago
Looking further into this, it seems like NetworkManager was taking control of the docker0 and associated interfaces causing Docker to crash.
The fix is to add the following to `/etc/NetworkManager/NetworkManager.conf:
[keyfile]
unmanaged-devices=interface-name:docker*;interface-name:br-*;interface-name:vmnet*;interface-name:vboxnet*;interface-name:veth*
While this resolves the problem with docker crashing, docker really should be able to not crash or handle this issue gracefully.
EDIT: This is incorrect diagnosis due to a number of factors in testing. See below comment for full problem.
It seems that changing the NetworkManager config doesn't actually fix this issue.... The key to reproduce seems to be:
1) Start a set of containers via docker-compose
2) Stop a set of containers via docker-compose
3) Restart docker via: systemctl restart docker
From that point on, dockerd
will always crash.
Running dockerd -D
gives the following:
# dockerd -D
INFO[2021-11-19T06:22:05.568776286Z] Starting up
DEBU[2021-11-19T06:22:05.569679285Z] Listener created for HTTP on unix (/var/run/docker.sock)
INFO[2021-11-19T06:22:05.570408035Z] detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf
DEBU[2021-11-19T06:22:05.571308118Z] Golang's threads limit set to 13950
INFO[2021-11-19T06:22:05.572328951Z] parsed scheme: "unix" module=grpc
INFO[2021-11-19T06:22:05.572403034Z] scheme "unix" not registered, fallback to default scheme module=grpc
INFO[2021-11-19T06:22:05.572507743Z] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>} module=grpc
INFO[2021-11-19T06:22:05.572560243Z] ClientConn switching balancer to "pick_first" module=grpc
DEBU[2021-11-19T06:22:05.572404784Z] metrics API listening on /var/run/docker/metrics.sock
INFO[2021-11-19T06:22:05.576421324Z] parsed scheme: "unix" module=grpc
INFO[2021-11-19T06:22:05.576517283Z] scheme "unix" not registered, fallback to default scheme module=grpc
INFO[2021-11-19T06:22:05.576597199Z] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>} module=grpc
INFO[2021-11-19T06:22:05.576639783Z] ClientConn switching balancer to "pick_first" module=grpc
DEBU[2021-11-19T06:22:05.578702282Z] Using default logging driver journald
DEBU[2021-11-19T06:22:05.578851907Z] [graphdriver] priority list: [btrfs zfs overlay2 fuse-overlayfs aufs overlay devicemapper vfs]
DEBU[2021-11-19T06:22:05.579263906Z] processing event stream module=libcontainerd namespace=plugins.moby
DEBU[2021-11-19T06:22:05.589925319Z] backingFs=extfs, projectQuotaSupported=false, indexOff="", userxattr="" storage-driver=overlay2
INFO[2021-11-19T06:22:05.589989777Z] [graphdriver] using prior storage driver: overlay2
DEBU[2021-11-19T06:22:05.590016319Z] Initialized graph driver overlay2
DEBU[2021-11-19T06:22:05.590264527Z] No quota support for local volumes in /userdata/docker/volumes: Filesystem does not support, or has not enabled quotas
DEBU[2021-11-19T06:22:05.594526650Z] Max Concurrent Downloads: 3
DEBU[2021-11-19T06:22:05.594574192Z] Max Concurrent Uploads: 5
DEBU[2021-11-19T06:22:05.594589942Z] Max Download Attempts: 5
INFO[2021-11-19T06:22:05.594636900Z] Loading containers: start.
DEBU[2021-11-19T06:22:05.594764942Z] Option Experimental: false
DEBU[2021-11-19T06:22:05.594793233Z] Option DefaultDriver: bridge
DEBU[2021-11-19T06:22:05.594825900Z] Option DefaultNetwork: bridge
DEBU[2021-11-19T06:22:05.594841650Z] Network Control Plane MTU: 1500
DEBU[2021-11-19T06:22:05.595069025Z] processing event stream module=libcontainerd namespace=moby
DEBU[2021-11-19T06:22:05.607310270Z] /sbin/iptables, [--wait -t filter -C FORWARD -j DOCKER-ISOLATION]
DEBU[2021-11-19T06:22:05.609321019Z] /sbin/iptables, [--wait -t nat -D PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]
DEBU[2021-11-19T06:22:05.611598351Z] /sbin/iptables, [--wait -t nat -D OUTPUT -m addrtype --dst-type LOCAL ! --dst 127.0.0.0/8 -j DOCKER]
DEBU[2021-11-19T06:22:05.613968142Z] /sbin/iptables, [--wait -t nat -D OUTPUT -m addrtype --dst-type LOCAL -j DOCKER]
DEBU[2021-11-19T06:22:05.616226933Z] /sbin/iptables, [--wait -t nat -D PREROUTING]
DEBU[2021-11-19T06:22:05.618132682Z] /sbin/iptables, [--wait -t nat -D OUTPUT]
DEBU[2021-11-19T06:22:05.620018306Z] /sbin/iptables, [--wait -t nat -F DOCKER]
DEBU[2021-11-19T06:22:05.621818180Z] /sbin/iptables, [--wait -t nat -X DOCKER]
DEBU[2021-11-19T06:22:05.623571971Z] /sbin/iptables, [--wait -t filter -F DOCKER]
DEBU[2021-11-19T06:22:05.625395179Z] /sbin/iptables, [--wait -t filter -X DOCKER]
DEBU[2021-11-19T06:22:05.627186303Z] /sbin/iptables, [--wait -t filter -F DOCKER-ISOLATION-STAGE-1]
DEBU[2021-11-19T06:22:05.628969261Z] /sbin/iptables, [--wait -t filter -X DOCKER-ISOLATION-STAGE-1]
DEBU[2021-11-19T06:22:05.630782260Z] /sbin/iptables, [--wait -t filter -F DOCKER-ISOLATION-STAGE-2]
DEBU[2021-11-19T06:22:05.632570176Z] /sbin/iptables, [--wait -t filter -X DOCKER-ISOLATION-STAGE-2]
DEBU[2021-11-19T06:22:05.634374133Z] /sbin/iptables, [--wait -t filter -F DOCKER-ISOLATION]
DEBU[2021-11-19T06:22:05.636136674Z] /sbin/iptables, [--wait -t filter -X DOCKER-ISOLATION]
DEBU[2021-11-19T06:22:05.637961923Z] /sbin/iptables, [--wait -t nat -n -L DOCKER]
DEBU[2021-11-19T06:22:05.639802048Z] /sbin/iptables, [--wait -t nat -N DOCKER]
DEBU[2021-11-19T06:22:05.641635755Z] /sbin/iptables, [--wait -t filter -n -L DOCKER]
DEBU[2021-11-19T06:22:05.643506796Z] /sbin/iptables, [--wait -t filter -n -L DOCKER-ISOLATION-STAGE-1]
DEBU[2021-11-19T06:22:05.645337587Z] /sbin/iptables, [--wait -t filter -n -L DOCKER-ISOLATION-STAGE-2]
DEBU[2021-11-19T06:22:05.647138503Z] /sbin/iptables, [--wait -t filter -N DOCKER-ISOLATION-STAGE-2]
DEBU[2021-11-19T06:22:05.649196210Z] /sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION-STAGE-1 -j RETURN]
DEBU[2021-11-19T06:22:05.651249543Z] /sbin/iptables, [--wait -A DOCKER-ISOLATION-STAGE-1 -j RETURN]
DEBU[2021-11-19T06:22:05.653251542Z] /sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION-STAGE-2 -j RETURN]
DEBU[2021-11-19T06:22:05.655404333Z] /sbin/iptables, [--wait -A DOCKER-ISOLATION-STAGE-2 -j RETURN]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x55912cbd84]
goroutine 1 [running]:
github.com/docker/docker/vendor/github.com/vishvananda/netlink.parseAddr(0x4000b840b4, 0x40, 0x40, 0x0, 0x400073adf4, 0x4, 0x280, 0x0, 0x0, 0x4000b840c8, ...)
/go/src/github.com/docker/docker/vendor/github.com/vishvananda/netlink/addr_linux.go:274 +0x174
github.com/docker/docker/vendor/github.com/vishvananda/netlink.(*Handle).AddrList(0x4000735440, 0x5592984250, 0x40007497a0, 0x2, 0x40007497a0, 0x0, 0x0, 0x4000735300, 0x1)
/go/src/github.com/docker/docker/vendor/github.com/vishvananda/netlink/addr_linux.go:199 +0x1a0
github.com/docker/docker/vendor/github.com/docker/libnetwork/netutils.ElectInterfaceAddresses(0x5591fbfd14, 0x7, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/netutils/utils_linux.go:81 +0xe4
github.com/docker/docker/daemon.initBridgeDriver(0x55929e1188, 0x4000866100, 0x40004b6b00, 0x0, 0x0)
/go/src/github.com/docker/docker/daemon/daemon_unix.go:948 +0x2bc
github.com/docker/docker/daemon.(*Daemon).initNetworkController(0x40004f01e0, 0x40004b6b00, 0x4000a95860, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/docker/docker/daemon/daemon_unix.go:891 +0x2ec
github.com/docker/docker/daemon.(*Daemon).restore(0x40004f01e0, 0x4000170580, 0x400015c000)
/go/src/github.com/docker/docker/daemon/daemon.go:490 +0x3d8
github.com/docker/docker/daemon.NewDaemon(0x55929a5db0, 0x4000170580, 0x40004b6b00, 0x40004a5e90, 0x0, 0x0, 0x0)
/go/src/github.com/docker/docker/daemon/daemon.go:1150 +0x20d8
main.(*DaemonCli).start(0x40004a5260, 0x40000a4780, 0x0, 0x0)
/go/src/github.com/docker/docker/cmd/dockerd/daemon.go:195 +0x588
main.runDaemon(...)
/go/src/github.com/docker/docker/cmd/dockerd/docker_unix.go:13
main.newDaemonCommand.func1(0x400023eb00, 0x40003f0e80, 0x0, 0x1, 0x0, 0x0)
/go/src/github.com/docker/docker/cmd/dockerd/docker.go:34 +0x78
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).execute(0x400023eb00, 0x40001c2010, 0x1, 0x1, 0x400023eb00, 0x40001c2010)
/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:850 +0x320
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x400023eb00, 0x0, 0x0, 0x7)
/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:958 +0x258
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:895
main.main()
/go/src/github.com/docker/docker/cmd/dockerd/docker.go:97 +0x188
This seems to be related to: https://github.com/vishvananda/netlink/issues/664
We have a 4G uplink via a mPCIe LTE card which brings up a ppp0 interface managed by NetworkManager.
We can hit this error by doing:
root@faceway:~# systemctl restart docker
(works fine)
root@faceway:~# systemctl restart docker
(works fine)
root@faceway:~# systemctl restart docker
(works fine)
root@faceway:~# nmcli c u 4G
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/3)
(ppp connection is now active)
root@faceway:~# systemctl restart docker
Job for docker.service failed because the control process exited with error code.
See "systemctl status docker.service" and "journalctl -xe" for details.
(dockerd crashes)
Thanks for reporting; also looks related / similar to https://github.com/docker/for-linux/issues/1281
Thanks for reporting; also looks related / similar to docker/for-linux#1281
Yes, this does look like the same issue.
@thaJeztah - Out of interest, how long does it normally take to have issues like this to go through the process and be available in an updated community package etc?
Just trying to get a grasp of process / procedures to assist in developing our plan to work around / fix the effect this issue has on our usage...
It looks like the issue has been fixed in vishvananda/netlink#665 but no new release have been made since then.
@CRCinAU I tried to set up a pptp connection on a VM to have a real ppp interface but with no luck, I can't reproduce this issue. As I'm not familiar with this type of interface, I'd probably have to set up a nlmon interface (which requires compiling the appropriate kernel module) and debug how pptpd/pppd create ppp interface to hopefully try to create my own dummy interface to reproduce the original bug.
Unfortunately, I believe these steps are required as the linked netlink PR states:
It was discovered that this does resolve a potential panic but there is other elements in the code-base that assume IFA_ADDRESS will be present. Maybe a larger fix to remove that assumption is necessary?
So, I have two questions for you:
ip link show
and ip addr show
please?@akerouanton - It may be possible for me to do a test build - however we use Ubuntu 18.0.4 on an embedded arm system - which I'm not sure if that complicates things.
Ideally, if I can test it, I'd like to get it pushed via the docker.com site to at least give us the option to install updated packages from a repo instead of trying to build / package / maintain it myself...
I'll have to do a bit of hardware mangling - my test unit had the LTE modem removed due to this crash, so I'll have to install it and ensure it works again (even if it does crash docker) to be able to grab other info...
Thanks for your investigation @CRCinAU I have faced the same issue on a similar setup with a USB 4G modem (ARM64, Ubuntu 18.04). It seems that this PR: https://github.com/moby/moby/pull/43718 updated the netlink version with the proposed fix: https://github.com/vishvananda/netlink/pull/665 included. But the crash still occurs with docker version 20.10.18 (latest version as of today).
Actually, I have just figured out that the PR https://github.com/moby/moby/pull/43718 is not included in the 20.10 branch but only on master and 22.06 branches.
I have tried a simple app with this code:
package main
import (
"fmt"
"github.com/vishvananda/netlink"
)
func main() {
fmt.Println("Version netlink: 8715fe718dfdf487a919acb6df7da109346bbfd6")
addrs, err := netlink.AddrList(nil, netlink.FAMILY_V4)
if err != nil {
fmt.Println(err)
return
}
for _, v := range addrs {
fmt.Println("-", v)
}
}
And it does not crash while the same code with the latest released version of netlink: 1.1.0, it crashes the same as Docker.
Yeah - I'm a bit disappointed overall that this still doesn't seem to have been addressed within the last year - especially as its a fatal error meaning docker breaks completely.
Description
Describe the results you received:
Describe the results you expected:
Additional information you deem important (e.g. issue happens only occasionally): This seems to be an ongoing issue - and we can only get things running by rebooting the machine... Trying to do a
systemctl restart docker
will continue to crash.Docker is installed on arm via:
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):