Closed fasmide closed 6 years ago
Ive just tried out a few other combinations:
Same results as the above issue
ping @cpuguy83 ptal
ping @fcrisciani as well; as you were profiling things recently if i'm not mistaken
Same issue here.
Memory grows up and thousands threads belong to docker daemon. I dumped a stack trace, and noticed thousands goroutines like this:
syscall.Syscall6(0x2d, 0x29, 0xc42115d000, 0x1000, 0x0, 0xc420618ca8, 0xc420618c9c, 0x28, 0x1000, 0x1)
/usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.recvfrom(0x29, 0xc42115d000, 0x1000, 0x1000, 0x0, 0xc420618ca8, 0xc420618c9c, 0x0, 0x1, 0xc4201fe800)
/usr/local/go/src/syscall/zsyscall_linux_amd64.go:1712 +0x99
syscall.Recvfrom(0x29, 0xc42115d000, 0x1000, 0x1000, 0x0, 0x1000, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/syscall/syscall_unix.go:252 +0xaf
github.com/docker/docker/vendor/github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0xc42091ac90, 0xc421127e90, 0x1, 0x1, 0x0, 0x0)
/go/src/github.com/docker/docker/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:613 +0x9b
github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay.(*network).watchMiss(0xc42092b7c0, 0xc42091ac90)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay/ov_network.go:718 +0x6c
created by github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay.(*network).initSandbox
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay/ov_network.go:706 +0x581
So the daemon launches a new watchMiss()
goroutine each time a container with network overlay attached is started. But the exit condition is never reached because closing a socket does not interrupt a call to recvfrom()
. You have to setup socket with receive timeout (or use select()
).
I temporary fixed it by getting the method SetReceiveTimeout()
from nl_linux.go
in the current version of github.com/vishvananda/netlink
package and called it from ov_network.go
.
@rverpillot this is the fix: https://github.com/docker/libnetwork/pull/1976
https://github.com/docker/libnetwork/pull/1976 was vendored in this repo through https://github.com/moby/moby/pull/35677, and is part of Docker 17.12 and up (also included in Docker EE 17.06.2-ee-7)
Description
I'm creating this issue because we are having weird behavior with docker in swarm mode and I'm finally able to reproduce this so i thought I'd share my findings. I do realize this may have nothing to do with docker but please bear with me...
Run docker in swarm-mode, use say 6 nodes (3 managers, 3 workers) with an overlay network thats encrypted and attachable.
Run LOTS of one-off short-lived containers that attaches to the overlay network and does whatever task.
Wait a few hours and the overlay network begins to break down (some containers cannot reach each other, some containers misses dns records for services, running swarm services suddenly looses ability to talk to other services .. seems somewhat random)
Wait a few more minutes and syslog will be full of swarmnodes leaving and joining again and lots of "cannot allocate memory" messages from failed iptables commands, exec docker-init, exec docker-runc and so on...
This makes me think its an docker memory leak, but I'm not sure if its docker, the kernel or something else
Steps to reproduce the issue:
docker network create --driver overlay --attachable --subnet 10.51.192.0/20 --opt encrypted my-overlay
.docker service create --name web --network my-overlay --mode=global --publish 80:80 strm/helloworld-http
.while true; do docker run --rm --network my-overlay byrnedo/alpine-curl -s http://web/ || break; sleep 0.1; done
.Describe the results you received: With the above setup the swarm will break down in about 6 hours And with breakdown i mean:
A few samples:
Describe the results you expected:
I did not expect it to break down - one may ask, why the **** would you create new containers all the time? well as it happens we do have a few swarms running, these swarms are monitored with nagios doing these "docker run --network blarh" to ensure services are running and responding something useful - i do realize nagios doesn't check services every 100ms, but every 5 minutes or so. We are able to make our production swarm run for about a two months before breaking down :)
Additional information you deem important (e.g. issue happens only occasionally):
All nodes are
Ubuntu 16.04.3 LTS
I have attached 3 heap pdf dumps taken from the swarm leader, one from the beginning, one at a few hours run time, and one the next day when the swarm was unstable. (I'm not that schooled in reading these but i cannot spot anything indicating a memory leak in dockerd, however, OS tools such as top and ps reveals the dockerd using 60-70% of the available memory - on all swarmnodes)
https://www.dropbox.com/s/4u8nzsezuxy1mbv/miniswarm_hour_0.pdf?dl=0 https://www.dropbox.com/s/qmat8rlhcoakbyt/miniswarm_evening.pdf?dl=0 https://www.dropbox.com/s/kknrzym1dbkby4k/miniswarm_next_day.pdf?dl=0
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.): My testing have been with droplets from digitalocean in their German location.
Note these machines have no swap