Closed thaJeztah closed 6 years ago
Briefly spoke with @crosbymichael on Slack, and he suspects that its probably something in the dockerd
code that is not restoring things correctly, and the sighup is fixing that
/cc @stevvooe @mlaventure
We are facing the similar issue, the difference is in reproduce steps. Wen we run out of memory on builders the containerd is killed and restarted by oom-killer. The result is the same.
@zmlpjuran thanks for adding that; yes I anticipated that if containerd was OOM-killed, the same would happen (see my top description)
I think @caomania and I may have experienced this in Docker for Mac today (17.12 mac49). Plausible this would be existing in a hyperkit/linuxkit based VM?
I did some more testing, and it looks like it's not always possible to recover by sendig a SIGHUP
to dockerd
.
Steps to reproduce;
docker run -it --rm --privileged -v /var/lib/docker docker:18.01 dockerd --debug --iptables=false
Then, opening an docker exec
in the container, and kill docker-containerd
;
docker exec -it $(docker ps -q -n1) sh
/ # killall -9 docker-containerd
/ # docker run --rm hello-world
docker: Error response from daemon: connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused".
ERRO[0002] error waiting for container: context canceled
ERRO[2018-01-30T00:30:53.660577429Z] a5b5dade85229266867c72d6411f3b3222b74715abb62822e6b39462b95cc7c2 cleanup: failed to delete container from containerd: no such container
ERRO[2018-01-30T00:30:53.680504829Z] Handler for POST /v1.35/containers/a5b5dade85229266867c72d6411f3b3222b74715abb62822e6b39462b95cc7c2/start returned error: connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused"
But even after doing a SIGHUP
of dockerd
, connection with containerd is lost (and something is consuming resources);
/ # killall -HUP dockerd
/ # docker run --rm hello-world
docker: Error response from daemon: connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused".
ERRO[0000] error waiting for container: context canceled
INFO[2018-01-27T05:33:45.611135136Z] Got signal to reload configuration, reloading from: /etc/docker/daemon.json
DEBU[2018-01-27T05:33:45.611409299Z] Reset Max Concurrent Downloads: 3
DEBU[2018-01-27T05:33:45.611551128Z] Reset Max Concurrent Uploads: 5
WARN[2018-01-27T05:33:45.642215587Z] failed to retrieve containerd version: rpc error: code = Internal desc = connection error: desc = "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused"
Also CPU goes up after killing docker-containerd
;
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
12 1 root S 269m 13% 1 88% dockerd --debug --iptables=false
106 12 root S 278m 14% 1 0% docker-containerd --config /var/ru
56 0 root S 1584 0% 0 0% sh
1 0 root S 1576 0% 1 0% sh -c dockerd --debug --iptables=f
129 56 root R 1520 0% 0 0% top
We are also seeing same/similar problem. Here is one example that I happen to still have logs for.
A process pushed the system into out-of-memory territory, and before OOM kill decided to kill the process that should be killed, docker-containerd got restarted.
Jan 29 17:45:19 3rtzbx1 dockerd: time="2018-01-29T17:45:09.352788352-05:00" level=info msg="killing and restarting containerd" module=libcontainerd pid=27069
the kernel oom-killer message comes after that line. I suspect containerd OOM and killed itself.
Regarding recovery, I can't recover system by sending SIGHUP to dockerd, it never work for me given the few instances I saw. Restarting docker sometimes worked, but in at least one case reboot is the only way to fix it.
In addition, it appears docker seems to be in a loop, rebuilding the various networks. this could be a cause for the high CPU utilization. For us only kill -9 or a reboot resolves the issue. see for example attached log, which keeps cycling messages.log
ping @mlaventure I can track the CPU usage down to https://github.com/moby/moby/blob/master/libcontainerd/client_daemon.go#L706
I'm working tracing down where the client is used/should be refreshed... but some help may save a lot of time.
Basically, containerd exits, docker restarts it, but the clients processing the client is getting connection refused when trying to connect to containerd like they are using an old handle or something.
@cpuguy83 I think you're right. containerd
deletes the old socket upon start I think that would make the old inode unreachable.
ATM, the remote client is only created once here: https://github.com/moby/moby/blob/master/libcontainerd/remote_daemon.go#L130
This either need to be put behind a lock so we can update all clients after containerd
get restarted or we can just create a new ephemeral one everytime it is needed.
The second case may be easier since it'd also take care of cases where containerd
is not started by the daemon. Not sure what the impact would be on performance though
We are also having this problem with version 17.12.0-ce
Tested on Debian stretch;
docker-containerd --version
containerd github.com/containerd/containerd v1.0.0 89623f28b87a6004d4b785663257362d1658a729
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:11:19 2017
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:09:54 2017
OS/Arch: linux/amd64
Experimental: false
Containers: 15
Running: 15
Paused: 0
Stopped: 0
Images: 53
Server Version: 17.12.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: N/A (expected: 89623f28b87a6004d4b785663257362d1658a729)
runc version: b2567b37d7b75eb4cf325b77297b140ea686ce8f
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.0-1-amd64
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 23.54GiB
Name: cameleon29
ID: SGKP:EZJJ:5R3A:4KHO:I3KS:JOG2:ZWCW:ZWHV:OJEK:CJSN:6IJC:W37E
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
FYI we had a very similar problem on Ubuntu 16.04.3 LTS
, but pkill -HUP dockerd
(or kill
ing -9
every process in ps aux | grep docker
) didn't help.
Docker version:
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:11:19 2017
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:09:53 2017
OS/Arch: linux/amd64
Experimental: false
We had a Datadog container that showed unhealthy
, so we tried to sudo docker kill
it, but this hung for 10+ hours.
We then tried to restart the Docker daemon, but it wouldn't come back up again, because:
Failed to connect to containerd: failed to dial "/var/run/docker/containerd/docker-containerd.sock": dial unix:///var/run/docker/containerd/docker-containerd.s[...]
We tried many different things that didn't help, but in the end, this "solved" our problem:
sudo apt-get update -q && sudo apt-get upgrade -qy
Hope this can help someone else in distress who lands on these pages.
One very similar problem too. Maybe it will be helpful for you.
My environment details (update my testing env to latest edge build):
Containers: 16
Running: 16
Paused: 0
Stopped: 0
Images: 12
Server Version: 18.02.0-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-693.17.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 3
Name: rbcentos.from.sh
ID: HQGR:BEDG:5NWR:H2SB:DE5Y:I2QP:XPFQ:374R:LAHE:HZCG:BRLF:MO6M
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: trueTotal Memory: 3.734GiB
Start from clean docker start & create container.
root 22420 2.0 1.5 984720 58992 ? Ssl 17:28 0:12 /usr/bin/dockerd
root 22426 0.7 0.6 840912 25728 ? Ssl 17:28 0:04 \_ docker-containerd --config /var/run/docker/containerd/containerd.toml
root 31926 0.0 0.0 7508 3188 ? Sl 17:37 0:00 \_ docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.l
root 31951 0.3 0.1 121232 4144 ? Ss 17:37 0:00 | \_ /usr/sbin/httpd -f /etc/httpd/apache-platform/httpd24-shared.conf
And after systemctl restart docker
root 31926 0.0 0.0 9972 3452 ? Sl 17:37 0:00 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/mob
root 31951 0.1 0.1 121232 4144 ? Ss 17:37 0:00 \_ /usr/sbin/httpd -f /etc/httpd/apache-platform/httpd24-shared.conf
root 5728 5.6 0.8 487924 33608 ? Ssl 17:40 0:00 /usr/bin/dockerd
root 5734 2.0 0.5 378924 23104 ? Ssl 17:40 0:00 \_ docker-containerd --config /var/run/docker/containerd/containerd.toml
grep -i ppid /proc/31926/status
PPid: 1
But i have no problems with docker interaction (kill/rm/stop/start etc.) with such as containers.
killall -9 dockerd
didn't help. Now i can fix only with recreate containers.
@MitRandi Thanks for the report. This is fixed in containerd 1.0.2 (currently in release candidate phase). Once this is released we can include it in a dockerd patch release.... this would be a problem for all versions of docker from 17.11 and up... but note the containerd patch would only be included in 17.12 and 18.03 (assuming the containerd patch is released soon).
I killed every process one at a time Above two steps worked for me .
@cpuguy83 I see that containerd 1.0.2 was merged in master. Will it be released in 17.12.1?
@cberner Hopefully. Working on it anyway.
Fixed my issue with a renegade container by restarting docker on the Preferences Reset page.
@cpuguy83 I couldn't see mention of this issue in the 17.12.1 release notes, did it make it in?
@lox looks like it’s described as;
Fix dockerd not being able to reconnect to containerd when it is restarted moby/moby#36173
Thanks @thaJeztah!
@thaJeztah was it fixed in 17.12.1? This PR seems like it wasn't merged: https://github.com/docker/docker-ce/pull/434
@cberner IIRC, containerd 1.0.2 adds some additional improvements, but https://github.com/moby/moby/pull/36173 was included in 17.12.1 (through https://github.com/docker/docker-ce/pull/417)
Ah got it, thanks!
FYI I still have an unkillable container with 17.12.1-ce
:
$ sudo docker version
Client:
Version: 17.12.1-ce
API version: 1.35
Go version: go1.9.4
Git commit: 7390fc6
Built: Tue Feb 27 22:17:40 2018
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.1-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.4
Git commit: 7390fc6
Built: Tue Feb 27 22:16:13 2018
OS/Arch: linux/amd64
Experimental: false
$ sudo docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
cd7d9365d53a datadog/docker-dd-agent:latest "/entrypoint.sh supe…" 3 weeks ago Up 7 days (unhealthy) 8125/udp, 8126/tcp dd-agent
$ sudo docker kill cd7d9365d53a
# ... nothing happens for 8+ hours ...
Note: This issue happens with Datadog containers specifically, and was originally filed as https://github.com/DataDog/docker-dd-agent/issues/284
EDIT: Maybe it's a different bug, e.g. #35933.
@tgropper The problem is with Docker in this case, not containerd.
This happened after recently upgrading docker to 17.12.0-ce. I restarted docker and it started working fine. OS: MacOS 10.13.3
Hello guys, does containerd have logs? I'm trying to figure out why it was not responding to calls from docker. Docker tried several times over 20 minutes but then killed it.
Some log lines are below. Docker server version is 17.12.0-ce.
time="2018-03-26T20:20:36.605022254+13:00" level=debug msg="daemon is not responding" binary=docker-containerd error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=libcontainerd
time="2018-03-26T20:22:15.993570366+13:00" level=info msg="killing and restarting containerd" module=libcontainerd pid=2942
@mkokho Upgrade to 17.12.1
Containerd logs are piped into the dockerd logs.
Upgrade your docker version to latest by following below mentioned commands then you should be fine... apt-get update apt-get remove docker docker-engine docker.io apt-get install docker-ce
When killing
docker-containerd
, interacting with containers (docker exec
,docker stop
,docker kill
) fails:But killing
dockerd
(either bykillall -9 dockerd
or aSIGHUP
;killall -HUP dockerd
) restores functionality.This problem could explain some reports about "unkillable" containers, where everything appears to be running, but interaction is not possible (possibly after
containerd
was OOM killed, but could have different causes).Steps to reproduce / information
Have docker running, start a container, and check output of
ps auxf
:docker-containerd
anddocker-containerd-shim
are child-processes ofdockerd
:Now, kill
docker-containerd
(killall -9 docker-containerd
).docker-containerd
is restarted (bydockerd
); observe thatdocker-containerd-shim
and the container process(es) are reparented (I haven't checked what the new parent process is, and if this is relevant). Thedocker-containerd-shim
processes are no longer child-process ofdocker-containerd
;At this point, interacting with containers is now broken..
Containers still show up as running:
Inspecting the container still works, and shows the
pid
of the container;But any interaction with the containers is broken;
When directly connecting to containerd, containers still show:
And can be inspected;
Shims are still up:
And the container is still functional, when using
docker-runc
;restore functionality
Kill
dockerd
(killall -9 dockerd
) orSIGHUP
(killall -HUP dockerd
).Observe that shims are not re-parented (which is probably expected);
But now it's possible again to interact with them:
Version of docker and containerd
Tested on Ubuntu 16.04 on DigitalOcean;