Closed Foorack closed 1 year ago
Could it fix itself if I kill the problematic shim process or will that make the main docker service to go crazy? I have seen a few people write "Just reboot the machine" or "Just restart the docker service" however there are a few database containers running that would cause complete downtime if they went down.
Additional information, don't know if it is helpful or not: The shim process doesn't have a normal child process like the other containers, but there are a lot of docker-containe userland process threads running under the shim process. The config.v2.json also says the state is "Running: true" even though the nginx process is non-existent.
Output of cat /var/lib/docker/containers/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67/config.v2.json:
{
"StreamConfig":{
},
"State":{
"Running":true,
"Paused":false,
"Restarting":false,
"OOMKilled":false,
"RemovalInProgress":false,
"Dead":false,
"Pid":9628,
"ExitCode":0,
"Error":"",
"StartedAt":"2017-02-10T08:19:37.822843356Z",
"FinishedAt":"0001-01-01T00:00:00Z",
"Health":null
},
"ID":"0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67",
"Created":"2017-02-10T08:19:37.415973424Z",
"Managed":false,
"Path":"nginx",
"Args":[
"-g",
"daemon off;"
],
"Config":{
"Hostname":"0589585b7d9e",
"Domainname":"",
"User":"",
"AttachStdin":false,
"AttachStdout":false,
"AttachStderr":false,
"ExposedPorts":{
"443/tcp":{
},
"80/tcp":{
}
},
"Tty":false,
"OpenStdin":false,
"StdinOnce":false,
"Env":[
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"NGINX_VERSION=1.11.8"
],
"Cmd":[
"nginx",
"-g",
"daemon off;"
],
"ArgsEscaped":true,
"Image":"nginx:1.11.8-alpine",
"Volumes":{
"/etc/letsencrypt":{
},
"/etc/nginx/nginx.conf":{
},
"/home/xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/web":{
},
"/var/www":{
}
},
"WorkingDir":"",
"Entrypoint":null,
"OnBuild":null,
"Labels":{
"com.docker.compose.config-hash":"07911b3e1f03e79c3ec63637fd9d1618d8f48483be05bc820dc287ea254038da",
"com.docker.compose.container-number":"1",
"com.docker.compose.oneoff":"False",
"com.docker.compose.project":"ckdc",
"com.docker.compose.service":"nginx",
"com.docker.compose.version":"1.10.0"
}
},
"Image":"sha256:f1fae62370b48b33cbe68e0c7f251194ff7566587cdcb406336484a9b35824ed",
"NetworkSettings":{
"Bridge":"",
"SandboxID":"e9e14f101ab1753ff481e86911aa332ee38e768c2681838003c1488c31955e08",
"HairpinMode":false,
"LinkLocalIPv6Address":"",
"LinkLocalIPv6PrefixLen":0,
"Networks":{
"ckdc_website":{
"IPAMConfig":null,
"Links":null,
"Aliases":[
"nginx",
"0589585b7d9e"
],
"NetworkID":"127c536f22632a3713a3b1790e758edf68e1851bc2d32a9e4b6370bfa1a5732f",
"EndpointID":"0ae734c2c97449b40c73891a3db085f2cd623a3b813dc39573586d2742f3669f",
"Gateway":"172.19.0.1",
"IPAddress":"172.19.0.7",
"IPPrefixLen":16,
"IPv6Gateway":"",
"GlobalIPv6Address":"",
"GlobalIPv6PrefixLen":0,
"MacAddress":"02:42:ac:13:00:07",
"IPAMOperational":false
}
},
"Service":null,
"Ports":{
"443/tcp":[
{
"HostIp":"0.0.0.0",
"HostPort":"443"
}
],
"80/tcp":[
{
"HostIp":"0.0.0.0",
"HostPort":"80"
}
]
},
"SandboxKey":"/var/run/docker/netns/e9e14f101ab1",
"SecondaryIPAddresses":null,
"SecondaryIPv6Addresses":null,
"IsAnonymousEndpoint":false,
"HasSwarmEndpoint":false
},
"LogPath":"/var/lib/docker/containers/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67-json.log",
"Name":"/ckdc_nginx_1",
"Driver":"aufs",
"MountLabel":"",
"ProcessLabel":"",
"RestartCount":0,
"HasBeenStartedBefore":true,
"HasBeenManuallyStopped":true,
"MountPoints":{
"/etc/letsencrypt":{
"Source":"/opt/letsencrypt",
"Destination":"/etc/letsencrypt",
"RW":true,
"Name":"",
"Driver":"",
"Type":"bind",
"Relabel":"rw",
"Spec":{
"Type":"bind",
"Source":"/opt/letsencrypt",
"Target":"/etc/letsencrypt"
}
},
"/etc/nginx/nginx.conf":{
"Source":"/opt/nginx/nginx.conf",
"Destination":"/etc/nginx/nginx.conf",
"RW":false,
"Name":"",
"Driver":"",
"Type":"bind",
"Relabel":"ro",
"Spec":{
"Type":"bind",
"Source":"/opt/nginx/nginx.conf",
"Target":"/etc/nginx/nginx.conf",
"ReadOnly":true
}
},
"/home/xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/web":{
"Source":"/home/xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/web",
"Destination":"/home/xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/web",
"RW":true,
"Name":"",
"Driver":"",
"Type":"bind",
"Relabel":"rw",
"Spec":{
"Type":"bind",
"Source":"/home/xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/web",
"Target":"/home/xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/web"
}
},
"/var/www":{
"Source":"/var/www",
"Destination":"/var/www",
"RW":true,
"Name":"",
"Driver":"",
"Type":"bind",
"Relabel":"rw",
"Spec":{
"Type":"bind",
"Source":"/var/www",
"Target":"/var/www"
}
}
},
"SecretReferences":null,
"AppArmorProfile":"",
"HostnamePath":"/var/lib/docker/containers/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67/hostname",
"HostsPath":"/var/lib/docker/containers/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67/hosts",
"ShmPath":"/var/lib/docker/containers/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67/shm",
"ResolvConfPath":"/var/lib/docker/containers/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67/resolv.conf",
"SeccompProfile":"",
"NoNewPrivileges":false
}
(The reason for all the mounted volumes is because we are transitioning to containerisation.)
I see now that "Pid" in the config.v2.json file is not a process on the machine... There is no process running with the PID 9628. 🤔
Killing the shim process made the whole dockerd unresponsive. Went into quick quick downtime and restarted the docker service. Everything was back within a minute and the ghost problem is now gone. I find this a dirty fix but it works. Due to the inability to reproduce the bug this can be closed, however, I still think the StackTrace from docker-compose top
might be worth looking into.
This just happened again. Nginx again. Can't stop container. Will try using the non-alpine version to see if that changes anything.
This seems to be a duplicate of https://github.com/docker/docker/issues/10589 however that issue is closed and we are still experiencing the same problem.
The linked issue has some similarities, but isdefinitely not the same (the whole runtime for docker has been replaced since docker 1.5)
What seems to happen in your case, is that either;
The docker compose Traceback won't help debugging here, because that error is only becsuse of this situation. What could help narrowing down what's happening;
docker-compose up
the application, or do any actions?)docker logs <container id>
); if that fails, there could be json-log logfiles inside the containers directory/cc @mlaventure
Hello @thaJeztah ,
This has happened twice today, on a production machine. In both cases I had to restart the docker service as this is the only known fix. Therefore I do not have any logfiles, and that the docker logs
command did not work as explained below. These are the following steps I have taken today:
nginx:
depends_on:
- portainer
- phpmyadmin
- artifactory
- nodebb
image: nginx:1.11.8-alpine
volumes:
- /var/www:/var/www
- /opt/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- /opt/letsencrypt:/etc/letsencrypt
- /home/**********/**********/**********/web:/home/**********/**********/**********/web
ports:
- "80:80"
- "443:443"
restart: always
networks:
- website
docker-compose -f /opt/ckdc/docker-compose.yml -p ckdc up -d
.docker-compose -f /opt/ckdc/docker-compose.yml -p ckdc logs nginx
however it got stuck at Attaching to ...
.docker logs ckdc_nginx_1
but it didn't print anything and I had to cancel with CTRL+C.docker-compose stop
, docker-compose kill
, docker stop
nor docker kill
worked. Docker said the process was stopped in all 4 commands but docker ps
showed it still running. Before I ran those 4 commands the docker-container-shim process did have the nginx subprocess. After I ran them the -shim process was still there but the nginx sub-process was gone. To me, this sounds like a daemon did not get this signal
error.docker rm ckdc_nginx_1
and then with the -f
flag. I do not remember the output of the first attempt but with the -f
printed an error message saying it failed to stop the container. It also deleted all files in /var/lib/docker/containers/containerid000000/
except the shm folder because of Device or resource busy
. The -shim process and the docker-proxy process was still running and occupied the port. I enabled live-restore according to the following post: https://docs.docker.com/engine/admin/live-restore/ and sent a SIGHUP to the dockerd process. I did "service docker restart" but all containers restarted. I don't know if this is intentional or not, if it is not then the dockerd process might not have gotten the SIGHUP signal.Thanks for the help! Any comments or ideas what could be causing these crashes?
@Foorack could you put your daemon into debug mode and provide the daemon logs when it happens? The daemon would have to have been in debug mode before the issue occurs.
Usually, the only reason for a shim to stay stuck is that it is waiting for IOs to be consumed. It could be an issue with the communication with containerd
though, sending a SIGUSR1
to docker-containerd
could help diagnosing this (the output will appears in the docker daemon logs).
The sub docker-container..
process under the shim are quite unexpected though, would it be possible to get the full command line for those (cat /proc/$pid/cmdline
)?
Hi,
I have encountered same problem. I had run mongodb, kafka images upon ubuntu 16.04 in swarm mode.
Please see belows, docker stop broker02.1.1rf7j2a6njxycwjbc6hc069k2 Error response from daemon: Cannot stop container broker02.1.1rf7j2a6njxycwjbc6hc069k2: Cannot kill container e1da77e2e13b681312d8bd6bab019c607cc16b88a0696826b2593dc093d34047: rpc error: code = 14 desc = grpc: the connection is unavailable
docker version Client: Version: 1.13.0 API version: 1.25 Go version: go1.7.3 Git commit: 49bf474 Built: Tue Jan 17 09:58:26 2017 OS/Arch: linux/amd64
Server: Version: 1.13.0 API version: 1.25 (minimum version 1.12) Go version: go1.7.3 Git commit: 49bf474 Built: Tue Jan 17 09:58:26 2017 OS/Arch: linux/amd64 Experimental: false
Containers: 35 Running: 6 Paused: 0 Stopped: 29 Images: 13 Server Version: 1.13.0 Storage Driver: aufs Root Dir: /var/lib/docker/aufs Backing Filesystem: extfs Dirs: 151 Dirperm1 Supported: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Swarm: active NodeID: mvimcgosd3ajnul07i7s23puw Is Manager: false Node Address: 192.168.0.29 Manager Addresses: 192.168.0.27:2377 192.168.0.32:2377 192.168.0.33:2377 Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: N/A (expected: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e) runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e init version: 949e6fa Security Options: apparmor seccomp Profile: default Kernel Version: 4.4.0-62-generic Operating System: Ubuntu 16.04.1 LTS OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 7.797 GiB Name: worker01 ID: 3BUL:LI3A:ZTCA:RKJE:VGDN:FNVG:5ROB:ZPFV:UYIN:W5AU:HMLK:VOA2 Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ WARNING: No swap limit support Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
Having the same issue:
$ sudo docker ps --no-trunc | grep 2a7
49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e <redacted> "<redacted>" 44 hours ago Up 44 hours 0.0.0.0:31239->8080/tcp mesos-1312479e-2636-44ff-94fa-5d62b4f87205-S0.2a74cafe-cbaf-44e7-83af-c79d967dcf4a
$ sudo docker inspect 49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e | grep Pid
"Pid": 24402,
"PidMode": "",
"PidsLimit": 0,
$ sudo ps -ef | grep 24402
$
$ sudo docker exec -it 49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e /bin/bash
rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped
$ sudo docker start 49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e
49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e
$ sudo docker exec -it 49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e /bin/bashrpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped
$ sudo docker kill 49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e
49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e
$ sudo docker ps --no-trunc | grep 49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e
49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e <redacted> "<redacted>" 45 hours ago Up 45 hours 0.0.0.0:31239->8080/tcp mesos-1312479e-2636-44ff-94fa-5d62b4f87205-S0.2a74cafe-cbaf-44e7-83af-c79d967dcf4a
Docker version 1.13.1, build 092cba3 on Mesos/Marathon
Some files do seem still open:
$ sudo lsof | grep 2a74cafe-cbaf-44e7-83af-c79d967dcf4a | wc -l
75
Edit: The base docker process seems to be hanging around:
$ sudo lsof | grep 2a74cafe-cbaf-44e7-83af-c79d967dcf4a
<snip>
docker 23383 23727 root 2w REG 202,1 64237922 1061162 <redacted>/stderr
$ ps -ef | grep 23383
root 23383 23023 0 Feb14 ? 00:00:09 docker -H unix:///var/run/docker.sock run --cpu-shares 102 --memory 134217728 --env-file /tmp/uBa2x8 <redacted>:/mnt/mesos/sandbox --net bridge -p 31239:8080/tcp --name mesos-1312479e-2636-44ff-94fa-5d62b4f87205-S0.2a74cafe-cbaf-44e7-83af-c79d967dcf4a <redacted>
Cont'd:
Trying to kill that process with docker kill
$ sudo strace -f -p 23383
[pid 23383] futex(0x115f4d0, FUTEX_WAIT, 0, NULL <unfinished ...> [11/26]
[pid 23727] futex(0xc42043bd10, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23724] futex(0xc42043b110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23722] futex(0xc42043a910, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23392] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173906, 78132442}) = 0
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173906, 78167235}) = 0
[pid 23392] clock_gettime(CLOCK_REALTIME, {1487265231, 275782449}) = 0
[pid 23392] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173906, 78423604}) = 0
[pid 23392] futex(0x115e838, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
[pid 23725] <... epoll_wait resumed> [{EPOLLIN|EPOLLOUT, {u32=3767856896, u64=139976752033536}}], 128, -1) = 1
[pid 23725] clock_gettime(CLOCK_MONOTONIC, {173956, 483853076}) = 0
[pid 23725] futex(0x115e838, FUTEX_WAKE, 1) = 1
[pid 23392] <... futex resumed> ) = 0
[pid 23725] read(3, <unfinished ...>
[pid 23392] sched_yield( <unfinished ...>
[pid 23725] <... read resumed> "1db\r\n{\"status\":\"kill\",\"id\":\"49b7"..., 4096) = 482
[pid 23392] <... sched_yield resumed> ) = 0
[pid 23725] futex(0xc42043bd10, FUTEX_WAKE, 1 <unfinished ...>
[pid 23392] futex(0x115e770, FUTEX_WAKE, 1 <unfinished ...>
[pid 23725] <... futex resumed> ) = 1
[pid 23727] <... futex resumed> ) = 0
[pid 23392] <... futex resumed> ) = 0
[pid 23727] epoll_wait(4, [], 128, 0) = 0
[pid 23727] select(0, NULL, NULL, NULL, {0, 3} <unfinished ...>
[pid 23725] read(3, <unfinished ...>
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173956, 484611405}) = 0
[pid 23392] clock_gettime(CLOCK_REALTIME, {1487265281, 682258878}) = 0
[pid 23392] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 23727] <... select resumed> ) = 0 (Timeout)
[pid 23727] epoll_wait(4, [], 128, 0) = 0
[pid 23727] epoll_wait(4, <unfinished ...>
[pid 23392] <... select resumed> ) = 0 (Timeout)
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173956, 484840215}) = 0
[pid 23392] clock_gettime(CLOCK_REALTIME, {1487265281, 682460501}) = 0
[pid 23392] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173956, 485092474}) = 0
[pid 23392] clock_gettime(CLOCK_REALTIME, {1487265281, 682712173}) = 0
[pid 23392] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 23725] <... read resumed> 0xc4203e8000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
[pid 23392] <... select resumed> ) = 0 (Timeout)
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173956, 485800021}) = 0
[pid 23392] clock_gettime(CLOCK_REALTIME, {1487265281, 683432263}) = 0
[pid 23392] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 23725] futex(0xc42043b510, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23392] <... select resumed> ) = 0 (Timeout)
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173956, 486084890}) = 0
[pid 23392] futex(0x115e838, FUTEX_WAIT, 0, {60, 0}^Cstrace: Process 23383 detached
sudo kill 23383
had no effect
sudo kill -9 23383
correctly ended the container process
@sgnn7 The daemon logs would be useful here, also if you could provide which files related to docker itself are still open it would be useful.
When the daemon doesn't answer to SIGTERM
, try sending a SIGUSR1
instead, it will dump its stacktrace so we can try and figure out where it is stuck.
@thinkhard-j-park your issue seem to be different, your error seems to indicate that containerd
is not running or responding. Could you confirm this?
@mlaventure Sadly I can't provide the daemon logs (security issues) and I can't do SIGUSR1
since I "fixed" it with a kill -9
to the parent process. The open files were mounted by mesos as the stdout sandbox and the problem seems very similar to this one with inability to properly clean up.
We're seeing a very similar error, with jwilder/nginx-proxy, lots of logs with docker logs
on the container hanging, and the container going into a ghost state when we try to stop or remove it. Here are some of our (censored) logs:
We first saw the issue when promoting to production from our Jenkins server, here's the relevant logs from that:
+ docker-compose down
Stopping nginx-proxy-backend ...
Stopping nginx-proxy-frontend ...
Stopping nginx-proxy-frontend ... done
Removing nginx-proxy-backend ...
Removing nginx-proxy-backend ... done
Network frontend is external, skipping
Network backend is external, skipping
+ docker-compose up -d --remove-orphans
Recreating nginx-proxy-frontend
Creating nginx-proxy-backend
ERROR: for nginx-proxy-frontend Error response from daemon: watch null for network "frontend"
Encountered errors while bringing up the project.
While debugging directly on the host, I couldn't kill or remove the container until the daemon was reloaded to turn on debugging:
[root@***host1 ~]# docker ps -a | grep frontend
ce7d9ba75846 1824b7cdcef5 "/app/docker-entry..." 5 weeks ago Up 2 weeks 0.0.0.0:80->80/tcp, 443/tcp nginx-proxy-frontend
[root@***host1 ~]# docker logs --since 5m ce7
^C
[root@***host1 ~]# cd /var/lib/docker/
[root@***host1 docker]# cd containers/
[root@***host1 containers]# cd ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30/
[root@***host1 ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30]# ls -al
total 209256
drwx------. 4 root root 4096 Feb 17 12:59 .
drwx------. 8 root root 4096 Feb 17 12:06 ..
-rw-r-----. 1 root root 214239548 Feb 3 14:05 ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30-json.log
drwx------. 2 root root 6 Jan 12 10:21 checkpoints
-rw-r--r--. 1 root root 3698 Feb 17 12:59 config.v2.json
-rw-r--r--. 1 root root 1210 Feb 17 12:59 hostconfig.json
-rw-r--r--. 1 root root 13 Jan 30 12:06 hostname
-rw-r--r--. 1 root root 172 Jan 30 12:06 hosts
-rw-r--r--. 1 root root 60 Jan 30 12:06 resolv.conf
-rw-r--r--. 1 root root 71 Jan 30 12:06 resolv.conf.hash
drwxrwxrwt. 2 root root 40 Jan 30 12:06 shm
[root@***host1 ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30]# cd
[root@***host1 ~]# docker kill ce7
ce7
[root@***host1 ~]# docker ps -a | grep frontend
ce7d9ba75846 1824b7cdcef5 "/app/docker-entry..." 5 weeks ago Up 2 weeks 0.0.0.0:80->80/tcp, 443/tcp nginx-proxy-frontend
[root@***host1 ~]# docker rm -f ce7
Error response from daemon: Unable to remove filesystem for ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: remove /var/lib/docker/containers/ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30/shm: device or resource busy
[root@***host1 ~]# vi /etc/docker/daemon.json
[root@***host1 ~]# systemctl reload docker
[root@***host1 ~]# docker kill ce7
Error response from daemon: Cannot kill container ce7: No such container: ce7
[root@***host1 ~]# docker ps -a | grep frontend
[root@***host1 ~]#
Our logs from journalctl include the following, unfortunately turning on debugging also eliminated the issue:
Feb 17 12:59:07 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:07.953293414-07:00" level=info msg="Container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30 failed to exit within 10 seconds of signal 15 - using the force"
Feb 17 12:59:07 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:07.953525238-07:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: rpc error: code = 2 desc = no such process"
Feb 17 12:59:17 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:17.953757486-07:00" level=info msg="Container ce7d9ba75846 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Feb 17 12:59:18 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:18.169064177-07:00" level=error msg="Handler for POST /v1.25/containers/ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30/rename returned error: watch null for network \"frontend\""
Feb 17 12:59:58 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:58.014949896-07:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: rpc error: code = 2 desc = no such process"
Feb 17 12:59:58 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:58.175071431-07:00" level=warning msg="cannot find target task in store" forwarder.id=qqmusvubg0z4ooqkysj8u5gta method="(*Dispatcher).UpdateTaskStatus" node.id=qqmusvubg0z4ooqkysj8u5gta node.session=ac7ejcds3chpr211pz5ybxqtu task.id=hgrk2zcwnyqt3bfmz63xiqqf4
Feb 17 12:59:58 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:58.175634006-07:00" level=warning msg="cannot find target task in store" forwarder.id=qqmusvubg0z4ooqkysj8u5gta method="(*Dispatcher).UpdateTaskStatus" node.id=qqmusvubg0z4ooqkysj8u5gta node.session=ac7ejcds3chpr211pz5ybxqtu task.id=hgrk2zcwnyqt3bfmz63xiqqf4
Feb 17 12:59:58 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:58.202512892-07:00" level=error msg="task unavailable" method="(*Dispatcher).processUpdates" module=dispatcher task.id=hgrk2zcwnyqt3bfmz63xiqqf4
Feb 17 13:00:08 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:08.015193075-07:00" level=info msg="Container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30 failed to exit within 10 seconds of signal 15 - using the force"
Feb 17 13:00:08 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:08.015862054-07:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: rpc error: code = 2 desc = no such process"
Feb 17 13:00:18 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:18.016063087-07:00" level=info msg="Container ce7d9ba75846 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Feb 17 13:00:39 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:39.694613460-07:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: rpc error: code = 2 desc = no such process"
Feb 17 13:00:49 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:49.694829630-07:00" level=info msg="Container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30 failed to exit within 10 seconds of signal 15 - using the force"
Feb 17 13:00:49 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:49.695054174-07:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: rpc error: code = 2 desc = no such process"
Feb 17 13:00:59 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:59.695285744-07:00" level=info msg="Container ce7d9ba75846 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Feb 17 13:02:05 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:02:05.396756518-07:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: rpc error: code = 2 desc = no such process"
....
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.510806160-07:00" level=info msg="Got signal to reload configuration, reloading from: /etc/docker/daemon.json"
Feb 17 13:05:13 ***host1.****.corp systemd[1]: Reloaded Docker Application Container Engine.
....
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.656533223-07:00" level=debug msg="***host1.****.corp-5b36d322a50d: Initiating bulk sync with nodes [***host1.****.corp-820f2ede70b5]"
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.656558292-07:00" level=debug msg="***host1.****.corp-5b36d322a50d: Initiating unsolicited bulk sync for networks [m94essemfcik8h9fbjqbsthnx tus35vbtw5zq155pwa90yl87a] with node ***host1.****.corp-820f2ede70b5"
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.656572447-07:00" level=debug msg="***host1.****.corp-5b36d322a50d: Initiating bulk sync with nodes [***host2.****.corp-2a899fe772fc]"
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.656581353-07:00" level=debug msg="***host1.****.corp-5b36d322a50d: Initiating unsolicited bulk sync for networks [m94essemfcik8h9fbjqbsthnx tus35vbtw5zq155pwa90yl87a un66divtohdyb0kwb7wfgagfh osghxmb47ofeackn2q2ndm3hz] with node ***host2.****.corp-2a899fe772fc"
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.658081017-07:00" level=debug msg="memberlist: TCP connection from=10.155.1.54:44614"
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.658687830-07:00" level=debug msg="***host1.****.corp-5b36d322a50d: Bulk sync to node ***host2.****.corp-2a899fe772fc took 1.449645ms"
Feb 17 13:05:17 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:17.043882197-07:00" level=debug msg="memberlist: TCP connection from=10.155.1.54:44616"
Feb 17 13:05:17 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:17.044355955-07:00" level=debug msg="***host1.****.corp-5b36d322a50d: Initiating bulk sync for networks [m94essemfcik8h9fbjqbsthnx un66divtohdyb0kwb7wfgagfh tus35vbtw5zq155pwa90yl87a osghxmb47ofeackn2q2ndm3hz] with node ***host2.****.corp-2a899fe772fc"
Feb 17 13:05:18 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:18.550826379-07:00" level=debug msg="memberlist: TCP connection from=10.155.1.54:44618"
Feb 17 13:05:21 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:21.188956135-07:00" level=debug msg="Calling GET /_ping"
Feb 17 13:05:21 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:21.189383775-07:00" level=debug msg="Calling POST /v1.25/containers/ce7/kill?signal=KILL"
Feb 17 13:05:21 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:21.189422271-07:00" level=error msg="Handler for POST /v1.25/containers/ce7/kill returned error: Cannot kill container ce7:No such container: ce7"
From the above logs, once we reloaded the daemon to turn on debugging, the ghost container finally died and we were able to deploy again using compose.
Our environment consists of two docker hosts (host1 and host2), setup with swarm mode to act as a k/v store for our attachable overlay networks (we'll finish the transition to services later), and classic swarm being used to spin up the containers on these hosts. There is a logspout container on host1 that is shipping logs off to another system.
One of my next steps will be to setup docker container log rotation to see if this prevents this issue. This container probably has more logs than any other due to some frequently polling from a monitoring tool.
@bmitch3020 it looks like docker
never received the exit event for that container or failed to process it.
There should be a message like:
libcontainerd: received containerd event: &types.Event{Type:"exit", Id:"ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30",...
If containerd generated the event and docker received it.
Could you check if they are present in your log (assuming the daemon was already in debug mode)?
@mlaventure I have some logs from after we got debugging turned on, but that's after the container failed to stop or rm:
[root@***host1 ~]# journalctl -u docker | grep containerd | grep exit | grep ce7d
Jan 30 12:06:38 ***host1.****.corp dockerd[186450]: time="2017-01-30T12:06:38.559080774-07:00"level=error msg="containerd: deleting container" error="exit status 1: \"container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30 does not exist\\none or more of the container deletionsfailed\\n\""
Feb 17 13:14:30 ***host1.****.corp dockerd[36555]: time="2017-02-17T13:14:30.743988657-07:00" level=debug msg="containerd: process exited" id=ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30 pid=init status=2 systemPid=8212
Feb 17 13:14:31 ***host1.****.corp dockerd[36555]: time="2017-02-17T13:14:31.728078304-07:00" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"exit\", Id:\"ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30\", Status:0x2, Pid:\"init\", Timestamp:(*timestamp.Timestamp)(0xc4201ee110)}"
If you are binding to things like 80:80 and 443:443 it would be interesting if we could see if these are networking related.
A quick test would be to run the container in host networking mode, it will not go though NAT or the docker proxy. The container will have access to the hosts network interfaces directly. This is just an idea because you suggested this seems to happen with heavy network use. This test would be a way for us to rule networking out of the things that could be causing this.
If anyone is comfortable with running this test, that would help a lot. The docker run command for doing this would be docker run --net host
, I don't know what the compose syntax would be and the container will bind to the default ports on our host interfaces. Its a little less secure as the container would have access to your loopback interface but if you trust the container you are good to go. Plus you will see a nice performance bump during this test for your networked applications ;)
@mlaventure I think we found a commonality between two occurrences of this error in our case: the logs for the container have grown to be massive so both times someone did a docker logs
for it that never exited before killing the container in Marathon UI.
@sgnn7 ... and given #30135 (closed) your finding may be very good news for you!
Let me close this ticket for now, as it looks like it went stale.
Hello,
Basically got a "ghost" container if that is the right word. It's completely unstoppable. Neither
docker-compose stop
nordocker-compose kill
works. They both say "done" butdocker-compose ps
says the service is still up. The service is an nginx container binding to port 80 and 443.docker stop
also says it stoppd the container butdocker ps
says it is still running. Same withdocker kill
.docker-compose top
throws a TypeError when it comes to the problematic container.Steps to reproduce the issue: Unknown how to reproduce as I don't know what caused it. The container is a normal nginx:1.11.8-alpine running for 2 days with normal configuration.
Describe the results you received: A shim process with no child process. (the shim process with id 0585...) All other process has a child process except the problematic container which suggests it might be stuck in a frozen state.
Describe the results you expected: I expected the container to be stopped... 🤔
Additional information you deem important (e.g. issue happens only occasionally): Issue happens completely at random. I have read something similar happened in 2015 and the common theme seems to be they are both network-heavy processes.
docker-compose logs
anddocker logs
just get stuck without printing anything.Output of
docker version
:Output of
docker-compose version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.): Physical box hosted at OVH (SoYouStart).