moby / moby

The Moby Project - a collaborative project for the container ecosystem to assemble container-based systems
https://mobyproject.org/
Apache License 2.0
68.62k stars 18.64k forks source link

Die hard containers #33300

Closed lskbr closed 1 year ago

lskbr commented 7 years ago

Hi,

I have been using docker since two years ago. I have a small cluster for development purposes where I use docker swarm to coordinate and distribute my containers. From time to time, I started to have problems killing containers. They do not give any errors on docker stop or docker kill, but they are always shown as running. I cannot stat these containers, but I can inspect them. If I restart the computer, they continue to be listed as running containers. It seems that they do not consume any memory or cpu (at least no significant amound of it).

There is nothing special about the containers I create. I use a custom built python image in +100 containers. Most of the time I have no problems stopping and restarting them. When I cannot kill them, I log into each node with an undead container and execute the following procedure:

I get the die hard containers id with:

docker ps -f label=project -q --no-trunc
sudo -s
systemctl disable docker
reboot

When the machine restarts:

sudo -s
cd /var/lib/docker/containers
rm -rf \
065e505db408f835fef8f79e46078f1b357f573366335471f64f51cf5e29e64d
sudo systemctl enable docker

This simple procedure needs to be executed in every machine with undead containers :-( It is timing consuming and it breaks my deployment scripts.

My current setup uses 7 machines running Linux:

There containers are started with the follwing template:

docker create --name X \
        --restart=unless-stopped -d -t -i --network=host \
        --add-host=postgres:${POSTGRES_IP} \
        --add-host=redis:${REDIS_IP} \
        --add-host=rabbitmq:${RABBITMQ_IP} \
        --add-host=cassandra:${CASSANDRA_IP} \
        -e PROJECT_ENVIRONMENT=${ENVIRONMENT} \
        -v /export/resources:/resources \
        -w ${DJANGO_HOME} -u service \
        ${IMAGE}

uname -a

Linux m1 4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

docker version
Client:
 Version:      17.04.0-ce
 API version:  1.24 (downgraded from 1.28)
 Go version:   go1.7.5
 Git commit:   4845c56
 Built:        Mon Apr  3 18:07:42 2017
 OS/Arch:      linux/amd64

Server:
 Version:      swarm/1.2.6
 API version:  1.22 (minimum version )
 Go version:   go1.7.1
 Git commit:   `git rev-parse --short HEAD`
 Built:        `date -u`
 OS/Arch:      linux/amd64
 Experimental: false

docker info

Containers: 12
 Running: 10
 Paused: 0
 Stopped: 2
Images: 113
Server Version: swarm/1.2.6
Role: primary
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint, whitelist
Nodes: 7
 m2: 192.168.1.2:2375
  └ ID: 74MO:2I2U:IHWY:SOLA:6NXF:WE5W:TKZF:ZVNB:IVC4:DXFN:ACGC:FCY6
  └ Status: Healthy
  └ Containers: 1 (1 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 12
  └ Reserved Memory: 0 B / 65.99 GiB
  └ Labels: kernelversion=4.4.0-72-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:43:55Z
  └ ServerVersion: 17.04.0-ce
 m3: 192.168.1.3:2375
  └ ID: ZTCL:O3JN:FPVN:IXBH:CVIJ:H5PX:YC6I:QUT5:LYIU:CIAC:RFR2:JQWI
  └ Status: Healthy
  └ Containers: 1 (1 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 8
  └ Reserved Memory: 0 B / 33.02 GiB
  └ Labels: kernelversion=4.4.0-75-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:43:37Z
  └ ServerVersion: 17.04.0-ce
 m4: 192.168.1.4:2375
  └ ID: XYCB:KIE3:YGUA:T5EX:SLC4:KQ4Y:HC2X:DNLH:WUDM:JDCP:S2HJ:BCTP
  └ Status: Healthy
  └ Containers: 2 (2 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 16
  └ Reserved Memory: 0 B / 99.05 GiB
  └ Labels: kernelversion=4.4.0-75-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:44:20Z
  └ ServerVersion: 17.04.0-ce
 m1: 192.168.1.1:2375
  └ ID: HPV4:JZPD:E43J:IUQF:FHCA:FAXU:O3MM:IUBH:DK2X:2W7G:EQEC:3UXJ
  └ Status: Healthy
  └ Containers: 4 (4 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 16
  └ Reserved Memory: 0 B / 99.05 GiB
  └ Labels: kernelversion=4.4.0-78-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:44:24Z
  └ ServerVersion: 17.04.0-ce
 m5: 192.168.1.5:2375
  └ ID: QBEK:YKU5:ZTER:SPUD:CAO3:KFNJ:QW22:NQUH:3AY5:MCQI:IQKA:YRF6
  └ Status: Healthy
  └ Containers: 1 (1 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 16
  └ Reserved Memory: 0 B / 99.05 GiB
  └ Labels: kernelversion=4.4.0-72-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:44:31Z
  └ ServerVersion: 17.04.0-ce
 m6: 192.168.1.6:2375
  └ ID: GINY:O4OX:IXRE:W3YU:ISJG:C36X:NNHN:5D44:3ZRE:4WZS:MR5B:75EO
  └ Status: Healthy
  └ Containers: 1 (1 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 21
  └ Reserved Memory: 0 B / 99.05 GiB
  └ Labels: kernelversion=4.4.0-78-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:44:10Z
  └ ServerVersion: 17.04.0-ce
 m7: 192.168.1.7:2375
  └ ID: 2YE6:7XTJ:IYCU:XXWD:S6F7:TPZQ:KY2O:3T7P:TXDT:ILKL:DSUI:QASK
  └ Status: Healthy
  └ Containers: 2 (0 Running, 0 Paused, 2 Stopped)
  └ Reserved CPUs: 0 / 21
  └ Reserved Memory: 0 B / 99.05 GiB
  └ Labels: kernelversion=4.4.0-78-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:44:14Z
  └ ServerVersion: 17.04.0-ce
Plugins: 
 Volume: 
 Network: 
Swarm: 
 NodeID: 
 Is Manager: false
 Node Address: 
Kernel Version: 4.4.0-78-generic
Operating System: linux
Architecture: amd64
CPUs: 110
Total Memory: 594.3GiB
Name: 0ba32ddaf9ef
Docker Root Dir: 
Debug Mode (client): false
Debug Mode (server): false
Experimental: false
Live Restore Enabled: false

WARNING: No kernel memory limit support
lskbr commented 7 years ago

I found an workaround to start and stop docker without rebooting the machine:

RUNNING=`docker ps --no-trunc -q`
sudo systemctl stop docker
for c in $RUNNING; do
sudo rm -rf /var/lib/docker/containers/$c
done
sudo systemctl start docker

Containers disapear, but I got these error messages during delete (with docker stopped):

rm: cannot remove '/var/lib/docker/containers/ec946754ce2d2463f5b91fd9cbcbc76cbeec706c2d1409714e95bb7c07082d1c/shm': Device or resource busy
rm: cannot remove '/var/lib/docker/containers/522e7fc7af940d392d0d5c0354087579c78ac4424e16cf6308ccf5db8b577f6f/shm': Device or resource busy
rm: cannot remove '/var/lib/docker/containers/0c2e334268731214e54b49f519081f1ab0fac273d7e79d0451b661afe5803917/shm': Device or resource busy

I updated the kernel in all machines, but the error continues:

uname -a

Linux adam 4.8.0-52-generic #55~16.04.1-Ubuntu SMP Fri Apr 28 14:36:29 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

docker info

docker info
Containers: 14
 Running: 2
 Paused: 0
 Stopped: 12
Images: 3
Server Version: 17.05.0-ce
Storage Driver: aufs
 Root Dir: /export/docker/aufs
 Backing Filesystem: extfs
 Dirs: 85
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.8.0-52-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 20
Total Memory: 94.33GiB
Name: eve
ID: GINY:O4OX:IXRE:W3YU:ISJG:C36X:NNHN:5D44:3ZRE:4WZS:MR5B:75EO
Docker Root Dir: /export/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 192.168.1.X:5000
 127.0.0.0/8
Live Restore Enabled: false
lskbr commented 7 years ago

This error seems to be related to high memory usage. It happens when Linux start killing process (OOM). The bad point is that docker has to be restarted and the container files manually removed.

fntlnz commented 7 years ago

Monitoring the oom_score would be helpful

 ls /proc/*/oom_score | awk '{print system("cat $1") " " $1 }' | sort

Also docker has ways to deal with the oom killer, like when you run a container

docker run --help | grep oom
      --oom-kill-disable               Disable OOM Killer
      --oom-score-adj int              Tune host's OOM preferences (-1000 to 1000)
duglin commented 7 years ago

I love the title - can't wait for the sequel... Die harder containers

lskbr commented 7 years ago

Hi got the "die hard containers" again today. I run ls /proc/*/oom_score | awk '{print system("cat $1") " " $1 }' | sort The output is 0 for all processes.

These containers cannot be stopped or killed. They also restart when I boot the machine. The only way I found to stop them is to stop docker, wipe the container directory and start docker again.

thaJeztah commented 1 year ago

Let me close this ticket for now, as it looks like it went stale.