moby / moby

The Moby Project - a collaborative project for the container ecosystem to assemble container-based systems
https://mobyproject.org/
Apache License 2.0
68.64k stars 18.64k forks source link

Docker Daemon becomes unresponsive #33710

Closed diegito closed 1 year ago

diegito commented 7 years ago

As suggested in #26768, I'm creating a new issue for this. Unfortunately I was not able to reproduce it locally to get more information.

I found the blocking issue in a couple of times, but I can't seem to reproduce. I can check the docker info or docker version but docker CLI becomes unresponsive when trying docker ps. The version is 1.13.1.

steps to reproduce: I noticed that one of my services was not getting response from an InfluxDB instance inside a docker container. So I went to visit that container to check the logs. I was able to curl the container and get some sort of response. docker ps and other commands worked fine. I also restarted the container successfully with docker restart. I performed docker logs but the logs were so many that I decided to ctrl+c and use the docker logs --tail option. This hung indefinitely till I decided to ctrl+c that as well. Next thing I know is that curling the container was no longer working and not even docker ps was working. The container I was exploring was influxdb container. Docker daemon logs don't show anything interesting.

Docker logs

Containers: 7
 Running: 5
 Paused: 0
 Stopped: 2
Images: 38
Server Version: 1.13.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 140
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-57-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.31 GiB
Name: sm5
ID: LFC3:5L6T:CP3I:HD3Q:KHEO:2SNM:QZQA:PS7S:I45N:HFQT:UPKN:7HMT
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Docker version

Client:
 Version:      1.13.1
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:50:14 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.1
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:50:14 2017
 OS/Arch:      linux/amd64
 Experimental: false

service docker status

● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2017-03-30 11:49:13 PDT; 2 months 14 days ago
     Docs: https://docs.docker.com
 Main PID: 849 (dockerd)
    Tasks: 221
   Memory: 5.8G
      CPU: 7h 29min 239ms
   CGroup: /system.slice/docker.service
           ├─  849 /usr/bin/dockerd -H fd://
           ├─  860 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime do
           ├─ 7908 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9301 -container-ip 172.17.0.6 -container-port 9300
           ├─ 7919 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9201 -container-ip 172.17.0.6 -container-port 9200
           ├─ 7925 docker-containerd-shim 93d605a05786020648c3da93119eb71f4be38a23eaa2c8358d5b259f68ebf347 /var/run/docker/libcontainerd/93d605a05786020648c3da93119eb71f4be38a23eaa2c8358d5b259f68ebf347 docker-runc
           ├─15060 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9600 -container-ip 172.17.0.4 -container-port 9600
           ├─15071 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 5044 -container-ip 172.17.0.4 -container-port 5044
           ├─15082 /usr/bin/docker-proxy -proto udp -host-ip 0.0.0.0 -host-port 5044 -container-ip 172.17.0.4 -container-port 5044
           ├─15088 docker-containerd-shim ac9b5d003f31116ac14d1bf49ddff722b155a158f13ab2539e0038372f03ad27 /var/run/docker/libcontainerd/ac9b5d003f31116ac14d1bf49ddff722b155a158f13ab2539e0038372f03ad27 docker-runc
           ├─15415 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8086 -container-ip 172.17.0.2 -container-port 8086
           ├─15427 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8083 -container-ip 172.17.0.2 -container-port 8083
           ├─16423 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 5601 -container-ip 172.17.0.5 -container-port 5601
           ├─16428 docker-containerd-shim 3c32277d4698674fa5b42e432eaf21378e3bd2d66abc6260d7a9667a4606bd5b /var/run/docker/libcontainerd/3c32277d4698674fa5b42e432eaf21378e3bd2d66abc6260d7a9667a4606bd5b docker-runc
           ├─26539 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 7002 -container-ip 172.17.0.3 -container-port 3000
           └─26544 docker-containerd-shim 013b16ffd22b7d15fdfaf46abe3283546e561400327d7d44ef29fcd0b2d30d23 /var/run/docker/libcontainerd/013b16ffd22b7d15fdfaf46abe3283546e561400327d7d44ef29fcd0b2d30d23 docker-runc

Jun 03 07:47:33 sm5 dockerd[849]: time="2017-06-03T07:47:33.905126201-07:00" level=warning msg="memberlist: Failed to send gossip to 192.168.51.109:7946: write udp [::]:7946->192.168.51.109:7946: use of closed network connection
Jun 03 07:47:33 sm5 dockerd[849]: time="2017-06-03T07:47:33-07:00" level=info msg="Firewalld running: false"
Jun 03 07:47:34 sm5 dockerd[849]: time="2017-06-03T07:47:34-07:00" level=info msg="Firewalld running: false"
Jun 03 07:47:34 sm5 dockerd[849]: time="2017-06-03T07:47:34-07:00" level=info msg="Firewalld running: false"
Jun 03 07:47:34 sm5 dockerd[849]: time="2017-06-03T07:47:34-07:00" level=info msg="Firewalld running: false"
Jun 11 22:17:14 sm5 dockerd[849]: time="2017-06-11T22:17:14.246266682-07:00" level=error msg="Handler for POST /v1.26/containers/create returned error: No such image: portainer/portainer:latest"
Jun 13 15:36:03 sm5 dockerd[849]: time="2017-06-13T15:36:03.706378940-07:00" level=info msg="Container 87bf5e95b8e9b57aaaa7beb6998246569b741b46ce4db8352ab90dc40f255546 failed to exit within 10 seconds of signal 15 - using the fo
Jun 13 15:36:03 sm5 dockerd[849]: time="2017-06-13T15:36:03.806396000-07:00" level=warning msg="Logger didn't exit in time: logs may be truncated"
Jun 13 15:51:17 sm5 dockerd[849]: time="2017-06-13T15:51:17.015564406-07:00" level=error msg="Handler for GET /v1.26/containers/all/top returned error: No such container: all"
Jun 13 15:51:19 sm5 dockerd[849]: time="2017-06-13T15:51:19.898349773-07:00" level=error msg="Handler for GET /v1.26/containers/alternatives.log/top returned error: No such container: alternatives.log"container: alternatives.lo

docker service restart worked in unblocking daemon.

diegito commented 7 years ago

After some time, I managed to reproduce it - although not in a consistent way. It seems that docker logs on very long logs, causes the docker daemon to freeze (docker ps does not work, docker info works with the output above) What I have on this docker daemon, is an ELK stack (ElasticSearch, Logstash and Kibana) with 3 logspout instances pushing logs from 3 swarm machines, into logstash that does the filtering. It happened that logstash was crashing a few times so to inspect it I performed docker logs logstash and an endless amount of logs started pouring out, such that I had to kill docker logs with ctrl+c. I tried also docker logs logstash --tail 100 but in that instance, the docker daemon simply hung up without giving me a response.

I'll leave it there for a couple more hours in case anybody reads this message and wishes me to run more commands on it before I restart it.

thebalaa commented 7 years ago

I am also seeing this behavior in 1.13.1 and as OP notes issue appears to be related to large log files but I'm not 100% sure. I was investigating an issue where logs where no longer being written to the log file in /var/lib/docker/container//.log. I am using the default json logging driver.

Would appreciate some advice on how I could work around this issue!

thaJeztah commented 7 years ago

Some recommendations from the top of my head;

Tezraine commented 6 years ago

I'm not sure if my issue is the exact same bug, but it is the same symptoms, and I have a way to reliably reproduce it. Hopefully this information will be of some help.

Steps to reproduce:

  1. (Optional Catalyst) Change the docker Advanced settings to minimum. (CPU:1 | Memory:1024 | Swap:512)
  2. (Optional) In an optional side terminal, run docker stats (good for seeing right when problems start. You will see stats start to blink out and back in when you get near your limit)
  3. run docker run -d neo4j (The actual image doesn't matter too much, replace neo4j with your image of choice, as long as it stays running in the background and uses a fair amount of memory)
  4. Repeat step 3 until catastrophic failure. (The more containers you get up before triggering the problem, the worse the daemon will hang, and will be less likely to self recover with time. Since the daemon takes time to spin up containers, the faster you spin up containers, the worse the issue will become)
  5. Docker daemon is now unresponsive

Extra notes:

If I open the Hyper-v Manager, The MobyLinuxVM is only using ~6-15% CPU while the daemon is unresponsive; and the host is running at about 10-20% CPU. Host Disk runs up to 100%, and is probably the main limit causing the problem. If your Docker is unresponsive, I recommend checking the host Disk utilization.

The problem starts when the sum of MEM % exceeds 100. If you exceed 100% far enough, a hard reboot will be required.

If you push the Daemon to it's limit, but not too far over, it seems it can recover when/if containers start to fail.

docker version

Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:21:34 2018
 OS/Arch:           windows/amd64
 Experimental:      false
Error response from daemon: A blocking operation was interrupted by a call to WSACancelBlockingCall.
PS C:\Users\ryanculp\workspace\Stampede\Docker Images\target\stampede\tak> docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:21:34 2018
 OS/Arch:           windows/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:29:02 2018
  OS/Arch:          linux/amd64
  Experimental:     false

docker info

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.93-linuxkit-aufs
Operating System: Docker for Windows
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.934GiB
Name: linuxkit-00155dec1142
ID: ZVKP:OHOH:RVAK:5KM5:2NGV:BF33:E5VP:CCOJ:QOCE:4PYQ:U3SY:ZAIH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 22
 Goroutines: 46
 System Time: 2018-09-05T17:07:33.6394803Z
 EventsListeners: 1
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
thaJeztah commented 1 year ago

Let me close this ticket for now, as it looks like it went stale.