Swarm memory issues after lots of one-off containers

fasmide commented 7 years ago

Description

I'm creating this issue because we are having weird behavior with docker in swarm mode and I'm finally able to reproduce this so i thought I'd share my findings. I do realize this may have nothing to do with docker but please bear with me...

Run docker in swarm-mode, use say 6 nodes (3 managers, 3 workers) with an overlay network thats encrypted and attachable.

Run LOTS of one-off short-lived containers that attaches to the overlay network and does whatever task.

Wait a few hours and the overlay network begins to break down (some containers cannot reach each other, some containers misses dns records for services, running swarm services suddenly looses ability to talk to other services .. seems somewhat random)

Wait a few more minutes and syslog will be full of swarmnodes leaving and joining again and lots of "cannot allocate memory" messages from failed iptables commands, exec docker-init, exec docker-runc and so on...

This makes me think its an docker memory leak, but I'm not sure if its docker, the kernel or something else

Steps to reproduce the issue:

Buy 6 1gb memory, ubuntu 16.04.3 machines somewhere, swarm init and join them.
Create an overlay network docker network create --driver overlay --attachable --subnet 10.51.192.0/20 --opt encrypted my-overlay.
Run whatever helloworld test image just to have something to probe docker service create --name web --network my-overlay --mode=global --publish 80:80 strm/helloworld-http.
On every machine, run lots of short-lived containers, requesting the web service while true; do docker run --rm --network my-overlay byrnedo/alpine-curl -s http://web/ || break; sleep 0.1; done.

Describe the results you received: With the above setup the swarm will break down in about 6 hours And with breakdown i mean:

It becomes unstable to docker-run new containers
The overlay network is ... broken (some containers can talk, others can't)
swarmnodes syslog are full of various log entries about commands failing due to "cannot allocate memory"-stuff

A few samples:

Oct 20 06:09:05 manager03 dockerd[16361]: time="2017-10-20T08:09:05.480969667+02:00" level=warning msg="failed to retrieve docker-runc version: fork/exec /usr/bin/docker-runc: cannot allocate memory"
Oct 20 06:09:05 manager03 dockerd[16361]: time="2017-10-20T08:09:05.482457668+02:00" level=warning msg="failed to retrieve docker-init version: fork/exec /usr/bin/docker-init: cannot allocate memory"
Oct 20 06:06:45 manager03 dockerd[16361]: time="2017-10-20T08:06:45.714209213+02:00" level=error msg="Failed to add firewall mark rule in sbox fd0c630 (1b71abf): reexec failed: fork/exec /proc/self/exe: cannot allocate memory"
Oct 20 06:06:45 manager03 dockerd[16361]: time="2017-10-20T08:06:45.692902514+02:00" level=warning msg="Failed to disable IPv6 on all interfaces on network namespace \"/var/run/docker/netns/fd0c630ec6c6\": reexec to set IPv6 failed: fork/exec /proc/self/exe: cannot allocate memory"
Oct 20 06:06:45 manager03 dockerd[16361]: time="2017-10-20T08:06:45.696093341+02:00" level=error msg="Resolver Start failed for container 1b71abfe97440e56b29b7fae7ae2458eef01f3a68503fedc5a390cd70e4537da, \"setting up IP table rules failed: reexec failed: fork/exec /proc/self/exe: cannot allocate memory\""
Oct 20 06:07:32 manager03 dockerd[16361]: time="2017-10-20T08:07:32.131978009+02:00" level=error msg="could not add input rule:  (iptables failed: iptables --wait -t filter -A INPUT -m policy --dir in --pol ipsec -p udp --dport 4789 -m u32 --u32 0>>22&0x3C@12&0xFFFFFF00=1048832 -j ACCEPT:  (fork/exec /sbin/iptables: cannot allocate memory)). Please do it manually."

Describe the results you expected:

I did not expect it to break down - one may ask, why the **** would you create new containers all the time? well as it happens we do have a few swarms running, these swarms are monitored with nagios doing these "docker run --network blarh" to ensure services are running and responding something useful - i do realize nagios doesn't check services every 100ms, but every 5 minutes or so. We are able to make our production swarm run for about a two months before breaking down :)

Additional information you deem important (e.g. issue happens only occasionally):

All nodes are Ubuntu 16.04.3 LTS

I have attached 3 heap pdf dumps taken from the swarm leader, one from the beginning, one at a few hours run time, and one the next day when the swarm was unstable. (I'm not that schooled in reading these but i cannot spot anything indicating a memory leak in dockerd, however, OS tools such as top and ps reveals the dockerd using 60-70% of the available memory - on all swarmnodes)

https://www.dropbox.com/s/4u8nzsezuxy1mbv/miniswarm_hour_0.pdf?dl=0 https://www.dropbox.com/s/qmat8rlhcoakbyt/miniswarm_evening.pdf?dl=0 https://www.dropbox.com/s/kknrzym1dbkby4k/miniswarm_next_day.pdf?dl=0

Output of docker version:

Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:42:18 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:40:56 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 2
 Running: 1
 Paused: 0
 Stopped: 1
Images: 2
Server Version: 17.09.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: v7t7nn8ltqfpjrrioniubdkl0
 Is Manager: false
 Node Address: 138.197.186.183
 Manager Addresses:
  138.197.177.236:2377
  138.197.177.42:2377
  138.197.186.231:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-93-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 992.3MiB
Name: worker03.microswarm
ID: KVRR:NOW3:7YWC:4TZP:3YDT:SKAI:375T:EUDN:MW2M:55KP:SHCW:Z2LG
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.): My testing have been with droplets from digitalocean in their German location.

Note these machines have no swap

fasmide commented 7 years ago

Ive just tried out a few other combinations:

Docker 17.10-ce on Ubuntu 16.04.3
Docker 17.09-ce on Ubuntu 17.10

Same results as the above issue

thaJeztah commented 7 years ago

ping @cpuguy83 ptal

thaJeztah commented 7 years ago

ping @fcrisciani as well; as you were profiling things recently if i'm not mistaken

rverpillot commented 6 years ago

Same issue here.

Memory grows up and thousands threads belong to docker daemon. I dumped a stack trace, and noticed thousands goroutines like this:

syscall.Syscall6(0x2d, 0x29, 0xc42115d000, 0x1000, 0x0, 0xc420618ca8, 0xc420618c9c, 0x28, 0x1000, 0x1)
        /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.recvfrom(0x29, 0xc42115d000, 0x1000, 0x1000, 0x0, 0xc420618ca8, 0xc420618c9c, 0x0, 0x1, 0xc4201fe800)
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:1712 +0x99
syscall.Recvfrom(0x29, 0xc42115d000, 0x1000, 0x1000, 0x0, 0x1000, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/syscall/syscall_unix.go:252 +0xaf
github.com/docker/docker/vendor/github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0xc42091ac90, 0xc421127e90, 0x1, 0x1, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:613 +0x9b
github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay.(*network).watchMiss(0xc42092b7c0, 0xc42091ac90)
        /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay/ov_network.go:718 +0x6c
created by github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay.(*network).initSandbox
        /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay/ov_network.go:706 +0x581

So the daemon launches a new watchMiss() goroutine each time a container with network overlay attached is started. But the exit condition is never reached because closing a socket does not interrupt a call to recvfrom(). You have to setup socket with receive timeout (or use select()).

I temporary fixed it by getting the method SetReceiveTimeout() from nl_linux.go in the current version of github.com/vishvananda/netlink package and called it from ov_network.go .

fcrisciani commented 6 years ago

@rverpillot this is the fix: https://github.com/docker/libnetwork/pull/1976

thaJeztah commented 6 years ago

https://github.com/docker/libnetwork/pull/1976 was vendored in this repo through https://github.com/moby/moby/pull/35677, and is part of Docker 17.12 and up (also included in Docker EE 17.06.2-ee-7)

moby / moby

Swarm memory issues after lots of one-off containers #35261