Closed orelmisan closed 1 year ago
@xpivarc
Docker version: (use docker info): 20.10.21
Please add the actual docker info
output.
Can we please create minimal configurations for reproducing? Most of the reproducing configuration looks unrelated and probably unnecessary. What is the minimum required to reproduce this?
Also yes, we need the rest of docker info
output.
Hey @stmcginnis @BenTheElder ,
The minimum configuration is to enable CPU manager.
I can reproduce with (docker info
):
Server:
Containers: 9
Running: 3
Paused: 0
Stopped: 6
Images: 23
Server Version: 20.10.17
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
Default Runtime: runc
Init Binary: docker-init
containerd version: 0197261a30bf81f1ee8e6a4dd2dea0ef95d67ccb
runc version: v1.1.3-0-g6724737
init version: de40ad0
Security Options:
seccomp
Profile: default
selinux
cgroupns
Kernel Version: 5.18.17-200.fc36.x86_64
Operating System: Fedora Linux 36 (Workstation Edition)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.2GiB
Name: localhost.localdomain
Docker Root Dir: /home/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
"feature-gates": "CPUManager=true" "cpu-manager-policy": "static"
I'll try to get back to digging into this, but it's not surprising to me that static CPU allocation doesn't work with nested containers. These cluster nodes do not have exclusive access to the kernel and resource limits are better tested via some other solution (e.g. VMs).
@BenTheElder This worked fine for us until the change https://github.com/kubernetes-sigs/kind/pull/2737 where 1.24/25 switches to systemd cgroup driver. I was wondering if it would be possible to go back or provide an opt-out option.
Note: The problem is devices that are not accessible. I did not check if cpu requests & limits are enforced.
We will try to find some time and debug what's going on in detail in the hope of fixing this.
@BenTheElder This worked fine for us until the change https://github.com/kubernetes-sigs/kind/pull/2737 where 1.24/25 switches to systemd cgroup driver.
Now that is surprising 👀
I was wondering if it would be possible to go back or provide an opt-out option.
It's possible to override with config patches to containerd and kubeadm/kubelet, however the ecosystem is moving towards cgroups v2 only (not sure when, I expect sometime next year), and in cgroupsv2 I haven't found anyone running CI not using the systemd backend which is generally recommended.
If we've regressed vs the cgroups driver, we should fix that. Unfortunately I don't personally have much time at the moment :/
Hi @BenTheElder have you had a chance to look into this issue?
No I have not. Kubernetes Project Infrastructure sustainability and Steering Committee related things have eaten most of my time lately.
If and when I do I will comment here.
I am seeing the same issue:
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "fc487f320c6f37e3fa43ce201591370cee2e43567bf526ba3d15250955f84390": OCI runtime exec failed: exec failed: unable to start container process: open /dev/ptmx: operation not permitted: unknown
Here is some more info on my setup:
CPUManager is not enabled.
For CI with multiple k8s version in kind < 1.24 works fine, 1.24 fails with this error.
It seems to affect all devices. We see errors like this for jobs running inside affected pods:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/site-packages/ansible/executor/process/worker.py", line 148, in run
sys.stdout = sys.stderr = open(os.devnull, 'w')
PermissionError: [Errno 1] Operation not permitted: '/dev/null'
stat looks normal
File: /dev/null
Size: 0 Blocks: 0 IO Block: 4096 character special file
Device: 50007ah/5243002d Inode: 6 Links: 1 Device type: 1,3
Access: (0666/crw-rw-rw-) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-12-22 06:26:20.656363572 +0000
Modify: 2022-12-22 06:26:20.656363572 +0000
Change: 2022-12-22 06:26:20.656363572 +0000
It doesn't happen immediately. It only appears after around 20minutes after the cluster is started.
I ran into the same issue with kind 0.18.0 (which I tried because it was the first kind release compatible with kubernetes 1.26 which has cpumanager as GA), reproducing with the following minimal yaml:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
- |-
kind: KubeletConfiguration
cpuManagerPolicy: "static"
reservedSystemCPUs: "0,1,2,3,4,5,6"
nodes:
- role: control-plane
- role: worker
- role: worker
Pretty much any pod I scheduled had issues with permissions on /dev, sometimes /dev/null, sometimes /dev/ptmx. I only had these issues with /dev when I tried to set cpuMangerPolicy to static, as they don't appear with the default policy.
docker info contains:
Client:
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.10.2
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.16.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
scan: Docker Scan (Docker Inc.)
Version: v0.23.0
Path: /usr/libexec/docker/cli-plugins/docker-scan
Server:
Containers: 5
Running: 5
Paused: 0
Stopped: 0
Images: 647
Server Version: 23.0.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 2456e983eb9e37e47538f59ea18f2043c9a73640
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.15.0-67-generic
Operating System: Ubuntu 22.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 64
Total Memory: 251.6GiB
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
It's a tricky bug, because it seems like something is getting misconfigured, yet all pods etc. schedule as expected and only fail at runtime so I am not even sure if there is any specific logging to even look for in the first place.
/dev/null recently had a runc bug iirc. kind v0.18 ships the latest runc release but there's also the host environment
My co maintainer Antonio has been out and would usually punt this particular type of issue my way.
I've been monitoring https://kubernetes.io/blog/2023/03/10/image-registry-redirect/ and Kubernetes is coming out on the other side now looking a lot more sustainable ...
I expect to be out for a breather next week, then a lot of the project including Antonio will be at KubeCon (not me unfortunately), after we're both around I'll be meeting with Antonio to review the backlog. We've worked on some other fixes for KIND since this issue was filed, but things that are more clearly root-caused and in-scope (like the iptables incompatibility issue) and getting those released.
cpuManagerPolicy: "static"
Seems to be the common thread. Kubernetes is not testing this with kind currently, SIG node typically tests this with "real" cloud-based clusters. We'll have to do some digging. I'm not seeing this issue crop up without this configuration so far, so a bit torn between the need to roll forward on what seems to be the supported and tested cgrouups driver going forward, and switching back. Kubernetes CI is moving towards systemd + cgroupv2 going forward and I'm not generally aware of any cgroupv2 CI w/o systemd cgroups.
Note: If you're doing configuration patching this advanced, you can patch to disable systemd cgroups in kubelet + containerd in the meantime.
I got time to look at this and I got the following. First I created the pod and then did kubectl exec -ti <pod_name> <command_that_doesnt_exist>
in the loop. This allowed me to identify when the cgroup configuration gets ruined.
It appears that once https://github.com/kubernetes/kubernetes/blob/64af1adaceba4db8d0efdb91453bce7073973771/pkg/kubelet/cm/cpumanager/cpu_manager.go#L513 is called all the devices are inaccessible for the container cgroup.
In case of kind I see (systemd log)
cri-containerd-6cbc6412df51daf51dc9922233b5b9b3e510b08f4df8a2dc9e9f8536b70fd4b9.scope: No devices matched by device filter.
whereas I don't see this on the working setup. Before I dived into containerd/runc | systemd I tried the latest image as it has all these components up-to-date and I can't reproduce the problem anymore.
So finally I just tried to update the runc in the old image and it seems to be working.
Note: Not confident but from a quick look I would say https://github.com/opencontainers/runc/commit/3b9582895b868561eb9260ac51b2ac6feb7798ae is the culprit. (This also explain the systemd log)
So the only question left is if we can update runc for 1.24> ? @BenTheElder
Can you try the release / images in https://github.com/kubernetes-sigs/kind/releases/tag/v0.18.0?
We're on runc 1.1.5 in the latest KIND release, which appears to contain https://github.com/opencontainers/runc/commit/3b9582895b868561eb9260ac51b2ac6feb7798ae
Can you try the release / images in https://github.com/kubernetes-sigs/kind/releases/tag/v0.18.0?
We're on runc 1.1.5 in the latest KIND release, which appears to contain opencontainers/runc@3b95828
Yes, that works just fine. Thank you. (Note for me, with new release there are new images)
Excellent! @orelmisan @Belpaire @smlx can you confirm if the latest release resolves this for you as well?
I'm attempting to minimally reproduce this being broken on v0.17 and confirm the runc upgrade solution in v0.18 without success so far:
I'm running this:
$HOME/kind-test.yaml
:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
- |-
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
"feature-gates": "CPUManager=true"
"cpu-manager-policy": "static"
"kube-reserved": "cpu=500m"
"system-reserved": "cpu=500m"
kind create cluster --config=$HOME/kind-test.yaml
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: qos-demo
spec:
containers:
- name: qos-demo-ctr
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "700m"
requests:
memory: "200Mi"
cpu: "700m"
EOF
kubectl exec -it qos-demo -- bash
Which works fine.
OK you have to leave it running for a bit, I see this on the above configured v0.17 cluster now, after trying again and waiting a few minutes before exec-ing again:
$ kubectl exec -it qos-demo -- bash
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "b0836e4a6b3a974ccf0ee2320a95d0c230a5cbfcfb5de41b07f19c820d2e0bf4": OCI runtime exec failed: exec failed: unable to start container process: open /dev/ptmx: operation not permitted: unknown
Whereas the same configuration on v0.18 does not have this even after a few minutes.
On v0.18 with @Belpaire's config from https://github.com/kubernetes-sigs/kind/issues/2999#issuecomment-1499076887, but brought down to a single node:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
- |-
kind: KubeletConfiguration
cpuManagerPolicy: "static"
reservedSystemCPUs: "0,1,2,3,4,5,6"
And using the test snippet above https://github.com/kubernetes-sigs/kind/issues/2999#issuecomment-1515356104
I'm not seeing the issue.
@Belpaire You mention:
I ran into the same issue with kind 0.18.0 (which I tried because it was the first kind release compatible with kubernetes 1.26 which has cpumanager as GA), reproducing with the following minimal yaml:
But we have 1.26 in https://github.com/kubernetes-sigs/kind/releases/tag/v0.17.0#new-features
Is there any chance you were using v0.17 and the 1.26.0 image? So far I can't reproduce, but v0.17 definitely has the issue described in https://github.com/kubernetes-sigs/kind/issues/2999#issue-1444163544, which appears to be fixed now in v0.18 images as outlined in https://github.com/kubernetes-sigs/kind/issues/2999#issuecomment-1501181087 / https://github.com/kubernetes-sigs/kind/issues/2999#issuecomment-1501218655.
@BenTheElder I retried it yesterday and it indeed seemed to bring up a cluster without issues, I maybe must have gotten confused while testing different images and kind versions for our setup. I scheduled some pods etc. and didn't get any /dev/ptmx issues, only difference I had was that my ubuntu went from 5.15.0-67-generic to 5.15.0-69-generic, but seems very doubtful that had any impact. So I think I must have been still trying with 0.17.0 or 0.18.0 and a wrong image somehow.
Thanks!
I believe we can close this now as fixed by the runc upgrade in v0.18+ images.
sorry this took so long 😅
Can confirm that this appears to have resolved my observed issues too. Thanks for the update!
What happened: Failed to exec into a Pod with QOS defined when CPU manager is enabled. After checking cgroup configuration for the Pod, I see only
c 136:* rwm
is allowed.What you expected to happen: I expected to be able to exec into the pod and get a shell and have the cgroup configuration set up correctly.
How to reproduce it (as minimally and precisely as possible):
Create a cluster with the following config file:
Create a Pod with QOS:
Try to exec into the Pod:
Any attempt to exec into other Pods fail from now on with the same reason.
Anything else we need to know?: SELinux is disabled.
This seems to be related to the change where kind uses systemd with 1.24/25 to manage cgroups.
This problem was not tested without CPU manager.
Environment:
kind version
): 0.17.0kubectl version
): v1.25.3docker info
): 20.10.21/etc/os-release
): RHEL 8.5