opencontainers / runc

CLI tool for spawning and running containers according to the OCI specification
https://www.opencontainers.org/
Apache License 2.0
11.95k stars 2.12k forks source link

docker-runc init failed on centos 7.6 xfs XFS: runc:[1:CHILD](3580) possible memory allocation deadlock in kmem_zone_alloc (mode:0x82d0) #2039

Open imatespl opened 5 years ago

imatespl commented 5 years ago

docker-runc init failed in console loop print XFS: runc:1:CHILD possible memory allocation deadlock in kmem_zone_alloc (mode:0x82d0) cat /proc/3580/stack [] congestion_wait+0x82/0x110 [] kmem_zone_alloc+0x8c/0x130 [xfs] [] xfs_trans_alloc+0x6d/0x140 [xfs] [] xfs_inactive_ifree+0x55/0x230 [xfs] [] xfs_inactive+0x8b/0x130 [xfs] [] xfs_fs_destroy_inode+0x95/0x190 [xfs] [] destroy_inode+0x3b/0x60 [] evict+0x115/0x180 [] iput+0xfc/0x190 [] __dentry_kill+0x120/0x180 [] dput+0xb0/0x160 [] drop_mountpoint+0x16/0x30 [] pin_kill+0x7d/0x100 [] group_pin_kill+0x21/0x30 [] namespace_unlock+0x71/0x80 [] drop_collected_mounts+0x54/0x60 [] put_mnt_ns+0x24/0x30 [] create_new_namespaces+0x165/0x180 [] unshare_nsproxy_namespaces+0x5a/0xc0 [] SyS_unshare+0x173/0x2e0 [] system_call_fastpath+0x22/0x27 [] 0xffffffffffffffff the memory use low Tasks: 240 total, 1 running, 239 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.1 us, 0.8 sy, 0.0 ni, 73.5 id, 24.5 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 32780772 total, 12619844 free, 16482460 used, 3678468 buff/cache KiB Swap: 0 total, 0 free, 0 used. 10126064 avail Mem

ps -aux --forest root 3558 0.0 0.0 7488 2804 ? Sl Apr02 0:21 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemo root 3572 0.0 0.0 138832 7832 ? Sl Apr02 0:00 _ docker-runc --root /var/run/docker/runtime-runc/moby --log /run/docker/conta root 3579 0.0 0.0 18388 4348 ? S Apr02 0:00 _ docker-runc init root 3580 1.3 0.0 18388 2384 ? D Apr02 197:00 _ docker-runc init

system 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

imatespl commented 5 years ago

docker-runc -version runc version 1.0.0-rc5+dev commit: 69663f0bd4b60df09991c08812a60108003fa340 spec: 1.0.0

cyphar commented 5 years ago

That looks like an XFS bug to me and I would suggest reporting it to CentOS, it's happening when we are creating a new mount namespaces with unshare(CLONE_NEWNS).

Aisuko commented 5 years ago

I hit the same issue with you guys. The machine is my Kubernetes worker node. The node with Red Hat Enterprise Linux Server 7.5 (Maipo) 3.10.0-1062.el7.x86_64 docker://19.3.2. And this issue can let PLEG and Kubelet stop work.

runc version 1.0.0-rc8
commit: 425e105d5a03fabd737a126ad93d62a9eeede87f
spec: 1.0.1-dev
brian-arms commented 4 years ago

I've also run into this issue; similar to @Aisuko, it presented on my Kubernetes worker node, which also showed PLEG and Kubelet failures. Node is running RHEL 7.6, Docker 18.09.9.

strgrb commented 4 years ago

Has anyone found the reason? I have the same issue with kubernetes 1.16.3, docker version 19.03.3, and containerd 1.2.10, nvidia 1.0.0-rc8+dev, docker-init 0.18.0

ddl-rolandsugars commented 4 years ago

@strgrb I've run into this issue as well, it looks like it is fixed in newer kernel versions, and may be related to https://github.com/opencontainers/runc/issues/1725 and https://bugzilla.redhat.com/show_bug.cgi?id=1507149

What OS and OS version are you running?

strgrb commented 4 years ago

@ddl-rolandsugars I use centos7.6 and kernel version is 3.10.0-957. I don't think my problem is related to #1725 because I can't see kernel messages like 'SLUB: Unable to allocate memory on node'. I set vm.lowmem_reserve_ratio="1 256 32" to reserve more memory for dma, and I have not seen this error for several weeks. But I don't know whether this is a correct solution.

ddl-rolandsugars commented 4 years ago

@strgrb What is the storage device you're using?

strgrb commented 4 years ago

@ddl-rolandsugars An ssd for / and another ssd for /var on some machine

ddl-rolandsugars commented 4 years ago

@strgrb my bad, I meant storage driver, if you run docker info it should tell you. I think you're probably using devicemapper?

Example output:


$ docker info
Client:
 Debug Mode: false

Server:
 Containers: 2
  Running: 0
  Paused: 0
  Stopped: 2
 Images: 5
 Server Version: 19.03.13
 Storage Driver: overlay2                          <= this.
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.19.76-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 1.944GiB
 Name: docker-desktop
 ID: RMQE:67ZV:WKCO:PNIS:FD2M:ON2P:HVYC:DSLI:5S7R:NEBG:RVDX:XTG7
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: gateway.docker.internal:3128
 HTTPS Proxy: gateway.docker.internal:3129
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine
``
strgrb commented 4 years ago

@ddl-rolandsugars My storage driver is overlay2

ThinkMo commented 3 years ago

Update kernel to 3.10.0-1062.el7.x86_64, and disable kmem account, add cgroup.memory=nokmem to boot cmdline also see https://access.redhat.com/solutions/532663