Open howardjohn opened 1 year ago
@AkihiroSuda do you think we can isolate the nested containers oom-killer ?
We should probably consider oom_adj the core components for starters.
Testing with v0.12.0 (Released March 7th 2022, pre-migration to systemd cgroup driver) reproduces this, so it's not a recent regression at least, for what little that's worth.
What happened:
Whenever an OOM happens in any container in the cluster, the entire cluster crashes and cannot recover.
What you expected to happen:
OOM just kills the impacted container, which is restarted by k8s, etc.
How to reproduce it (as minimally and precisely as possible):
Wait a minute or so and it will OOM and things will break.
Anything else we need to know?:
dmesg:
Note here its runc init, but the same has happened with kubelet, kindnet, and my own apps.
When it happens the docker container of the control plane restarts. After the restart, kubelet cannot start:
Environment:
Tested on 2 machines, both infos included below
kind version
): kind v0.18.0 go1.20 linux/amd64docker info
orpodman info
):Server: Containers: 3 Running: 3 Paused: 0 Stopped: 0 Images: 3 Server Version: 20.10.23 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc Default Runtime: runc Init Binary: docker-init containerd version: 31aa4358a36870b21a992d3ad2bef29e1d693bec runc version: v1.1.4-0-g5fd4c4d init version: de40ad0 Security Options: apparmor seccomp Profile: default cgroupns Kernel Version: 6.1.15-1rodete3-amd64 Operating System: Debian GNU/Linux rodete OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 31.07GiB Name: howardjohn-glaptop ID: T75S:255Q:OFTL:ZM4Y:BRIG:KCXM:6SG6:FNTW:QN5L:BAQU:SRND:SC5F Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
Client: Context: default Debug Mode: false Plugins: app: Docker App (Docker Inc., v0.9.1-beta3) buildx: Docker Buildx (Docker Inc., v0.10.0-docker) scan: Docker Scan (Docker Inc., v0.23.0)
Server: Containers: 4 Running: 3 Paused: 0 Stopped: 1 Images: 47 Server Version: 20.10.23 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux Default Runtime: runc Init Binary: docker-init containerd version: 31aa4358a36870b21a992d3ad2bef29e1d693bec runc version: v1.1.4-0-g5fd4c4d init version: de40ad0 Security Options: apparmor seccomp Profile: default cgroupns Kernel Version: 6.1.15-1rodete3-amd64 Operating System: Debian GNU/Linux rodete OSType: linux Architecture: x86_64 CPUs: 48 Total Memory: 188.9GiB Name: howardjohn.c.googlers.com ID: HG5E:A5RL:NQMU:2QU2:JXFQ:LNV7:TVS3:CXBP:4UX2:EODZ:X62Q:L4U4 Docker Root Dir: /var/lib/docker Debug Mode: false Username: howardjohn Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false