kubernetes-sigs / kind

Kubernetes IN Docker - local clusters for testing Kubernetes
https://kind.sigs.k8s.io/
Apache License 2.0
13.56k stars 1.57k forks source link

cgroups misconfiguration #2999

Closed orelmisan closed 1 year ago

orelmisan commented 2 years ago

What happened: Failed to exec into a Pod with QOS defined when CPU manager is enabled. After checking cgroup configuration for the Pod, I see only c 136:* rwm is allowed.

What you expected to happen: I expected to be able to exec into the pod and get a shell and have the cgroup configuration set up correctly.

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster with the following config file:

    kind: Cluster
    apiVersion: kind.x-k8s.io/v1alpha4
    containerdConfigPatches:
    - |-
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry:5000"]
    endpoint = ["http://registry:5000"]
    nodes:
    - role: control-plane
    - role: worker
    kubeadmConfigPatches:
    - |-
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        "feature-gates": "CPUManager=true"
        "cpu-manager-policy": "static"
        "kube-reserved": "cpu=500m"
        "system-reserved": "cpu=500m"
    extraMounts:
    - containerPath: /var/log/audit
    hostPath: /var/log/audit
    readOnly: true
    - containerPath: /dev/vfio/
    hostPath: /dev/vfio/
    - role: worker
    kubeadmConfigPatches:
    - |-
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        "feature-gates": "CPUManager=true"
        "cpu-manager-policy": "static"
        "kube-reserved": "cpu=500m"
        "system-reserved": "cpu=500m"
    extraMounts:
    - containerPath: /var/log/audit
    hostPath: /var/log/audit
    readOnly: true
    - containerPath: /dev/vfio/
    hostPath: /dev/vfio/
    kubeadmConfigPatches:
    - |
    kind: ClusterConfiguration
    metadata:
    name: config
    etcd:
    local:
      dataDir: /tmp/kind-cluster-etcd
  2. Create a Pod with QOS:

    # cat << EOF  | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
    name: qos-demo
    spec:                   
    containers:
    - name: qos-demo-ctr
    image: nginx
    resources:
      limits:
        memory: "200Mi"
        cpu: "700m"
      requests:
        memory: "200Mi"
        cpu: "700m"
    EOF                
  3. Try to exec into the Pod:

    # kubectl exec -it qos-demo bash
    kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
    error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "b81e55425e38f6a88c79fe45269a07e12573c9589410dc7f4a220e6d9012bce7": OCI runtime exec failed: exec failed: unable to start container process: open /dev/ptmx: operation not permitted: unknown

Any attempt to exec into other Pods fail from now on with the same reason.

Anything else we need to know?: SELinux is disabled.

This seems to be related to the change where kind uses systemd with 1.24/25 to manage cgroups.

This problem was not tested without CPU manager.

Environment:

orelmisan commented 2 years ago

@xpivarc

stmcginnis commented 2 years ago

Docker version: (use docker info): 20.10.21

Please add the actual docker info output.

BenTheElder commented 2 years ago

Can we please create minimal configurations for reproducing? Most of the reproducing configuration looks unrelated and probably unnecessary. What is the minimum required to reproduce this?

Also yes, we need the rest of docker info output.

xpivarc commented 2 years ago

Hey @stmcginnis @BenTheElder , The minimum configuration is to enable CPU manager. I can reproduce with (docker info):

Server:
 Containers: 9
  Running: 3
  Paused: 0
  Stopped: 6
 Images: 23
 Server Version: 20.10.17
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 0197261a30bf81f1ee8e6a4dd2dea0ef95d67ccb
 runc version: v1.1.3-0-g6724737
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  selinux
  cgroupns
 Kernel Version: 5.18.17-200.fc36.x86_64
 Operating System: Fedora Linux 36 (Workstation Edition)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.2GiB
 Name: localhost.localdomain
 Docker Root Dir: /home/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
BenTheElder commented 2 years ago
    "feature-gates": "CPUManager=true"
   "cpu-manager-policy": "static"

I'll try to get back to digging into this, but it's not surprising to me that static CPU allocation doesn't work with nested containers. These cluster nodes do not have exclusive access to the kernel and resource limits are better tested via some other solution (e.g. VMs).

BenTheElder commented 2 years ago

See #1578 https://github.com/kubernetes-sigs/kind/issues/2848

xpivarc commented 2 years ago

@BenTheElder This worked fine for us until the change https://github.com/kubernetes-sigs/kind/pull/2737 where 1.24/25 switches to systemd cgroup driver. I was wondering if it would be possible to go back or provide an opt-out option.

Note: The problem is devices that are not accessible. I did not check if cpu requests & limits are enforced.

We will try to find some time and debug what's going on in detail in the hope of fixing this.

BenTheElder commented 2 years ago

@BenTheElder This worked fine for us until the change https://github.com/kubernetes-sigs/kind/pull/2737 where 1.24/25 switches to systemd cgroup driver.

Now that is surprising 👀

I was wondering if it would be possible to go back or provide an opt-out option.

It's possible to override with config patches to containerd and kubeadm/kubelet, however the ecosystem is moving towards cgroups v2 only (not sure when, I expect sometime next year), and in cgroupsv2 I haven't found anyone running CI not using the systemd backend which is generally recommended.

If we've regressed vs the cgroups driver, we should fix that. Unfortunately I don't personally have much time at the moment :/

orelmisan commented 1 year ago

Hi @BenTheElder have you had a chance to look into this issue?

BenTheElder commented 1 year ago

No I have not. Kubernetes Project Infrastructure sustainability and Steering Committee related things have eaten most of my time lately.

If and when I do I will comment here.

smlx commented 1 year ago

I am seeing the same issue:

error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "fc487f320c6f37e3fa43ce201591370cee2e43567bf526ba3d15250955f84390": OCI runtime exec failed: exec failed: unable to start container process: open /dev/ptmx: operation not permitted: unknown

Here is some more info on my setup:

CPUManager is not enabled.

For CI with multiple k8s version in kind < 1.24 works fine, 1.24 fails with this error.

It seems to affect all devices. We see errors like this for jobs running inside affected pods:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.9/site-packages/ansible/executor/process/worker.py", line 148, in run
    sys.stdout = sys.stderr = open(os.devnull, 'w')
PermissionError: [Errno 1] Operation not permitted: '/dev/null'

stat looks normal

  File: /dev/null
  Size: 0           Blocks: 0          IO Block: 4096   character special file
Device: 50007ah/5243002d    Inode: 6           Links: 1     Device type: 1,3
Access: (0666/crw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-12-22 06:26:20.656363572 +0000
Modify: 2022-12-22 06:26:20.656363572 +0000
Change: 2022-12-22 06:26:20.656363572 +0000

It doesn't happen immediately. It only appears after around 20minutes after the cluster is started.

Belpaire commented 1 year ago

I ran into the same issue with kind 0.18.0 (which I tried because it was the first kind release compatible with kubernetes 1.26 which has cpumanager as GA), reproducing with the following minimal yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
  - |-
    kind: KubeletConfiguration
    cpuManagerPolicy: "static"
    reservedSystemCPUs: "0,1,2,3,4,5,6"
nodes:
  - role: control-plane
  - role: worker
  - role: worker

Pretty much any pod I scheduled had issues with permissions on /dev, sometimes /dev/null, sometimes /dev/ptmx. I only had these issues with /dev when I tried to set cpuMangerPolicy to static, as they don't appear with the default policy.

docker info contains:

  Client:
Context:    default
Debug Mode: false
Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.16.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
Containers: 5
  Running: 5
  Paused: 0
  Stopped: 0
Images: 647
Server Version: 23.0.1
Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 2456e983eb9e37e47538f59ea18f2043c9a73640
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Security Options:
  apparmor
  seccomp
  Profile: builtin
  cgroupns
Kernel Version: 5.15.0-67-generic
Operating System: Ubuntu 22.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 64
Total Memory: 251.6GiB
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
  127.0.0.0/8
Live Restore Enabled: false

It's a tricky bug, because it seems like something is getting misconfigured, yet all pods etc. schedule as expected and only fail at runtime so I am not even sure if there is any specific logging to even look for in the first place.

BenTheElder commented 1 year ago

/dev/null recently had a runc bug iirc. kind v0.18 ships the latest runc release but there's also the host environment

BenTheElder commented 1 year ago

My co maintainer Antonio has been out and would usually punt this particular type of issue my way.

I've been monitoring https://kubernetes.io/blog/2023/03/10/image-registry-redirect/ and Kubernetes is coming out on the other side now looking a lot more sustainable ...

I expect to be out for a breather next week, then a lot of the project including Antonio will be at KubeCon (not me unfortunately), after we're both around I'll be meeting with Antonio to review the backlog. We've worked on some other fixes for KIND since this issue was filed, but things that are more clearly root-caused and in-scope (like the iptables incompatibility issue) and getting those released.

cpuManagerPolicy: "static"

Seems to be the common thread. Kubernetes is not testing this with kind currently, SIG node typically tests this with "real" cloud-based clusters. We'll have to do some digging. I'm not seeing this issue crop up without this configuration so far, so a bit torn between the need to roll forward on what seems to be the supported and tested cgrouups driver going forward, and switching back. Kubernetes CI is moving towards systemd + cgroupv2 going forward and I'm not generally aware of any cgroupv2 CI w/o systemd cgroups.

Note: If you're doing configuration patching this advanced, you can patch to disable systemd cgroups in kubelet + containerd in the meantime.

xpivarc commented 1 year ago

I got time to look at this and I got the following. First I created the pod and then did kubectl exec -ti <pod_name> <command_that_doesnt_exist> in the loop. This allowed me to identify when the cgroup configuration gets ruined. It appears that once https://github.com/kubernetes/kubernetes/blob/64af1adaceba4db8d0efdb91453bce7073973771/pkg/kubelet/cm/cpumanager/cpu_manager.go#L513 is called all the devices are inaccessible for the container cgroup.

In case of kind I see (systemd log) cri-containerd-6cbc6412df51daf51dc9922233b5b9b3e510b08f4df8a2dc9e9f8536b70fd4b9.scope: No devices matched by device filter. whereas I don't see this on the working setup. Before I dived into containerd/runc | systemd I tried the latest image as it has all these components up-to-date and I can't reproduce the problem anymore.

So finally I just tried to update the runc in the old image and it seems to be working.

Note: Not confident but from a quick look I would say https://github.com/opencontainers/runc/commit/3b9582895b868561eb9260ac51b2ac6feb7798ae is the culprit. (This also explain the systemd log)

So the only question left is if we can update runc for 1.24> ? @BenTheElder

BenTheElder commented 1 year ago

Can you try the release / images in https://github.com/kubernetes-sigs/kind/releases/tag/v0.18.0?

We're on runc 1.1.5 in the latest KIND release, which appears to contain https://github.com/opencontainers/runc/commit/3b9582895b868561eb9260ac51b2ac6feb7798ae

xpivarc commented 1 year ago

Can you try the release / images in https://github.com/kubernetes-sigs/kind/releases/tag/v0.18.0?

We're on runc 1.1.5 in the latest KIND release, which appears to contain opencontainers/runc@3b95828

Yes, that works just fine. Thank you. (Note for me, with new release there are new images)

BenTheElder commented 1 year ago

Excellent! @orelmisan @Belpaire @smlx can you confirm if the latest release resolves this for you as well?

BenTheElder commented 1 year ago

I'm attempting to minimally reproduce this being broken on v0.17 and confirm the runc upgrade solution in v0.18 without success so far:

I'm running this: $HOME/kind-test.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
  - |-
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        "feature-gates": "CPUManager=true"
        "cpu-manager-policy": "static"
        "kube-reserved": "cpu=500m"
        "system-reserved": "cpu=500m"
kind create cluster --config=$HOME/kind-test.yaml

cat <<EOF  | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: qos-demo
spec:                   
  containers:
  - name: qos-demo-ctr
    image: nginx
    resources:
      limits:
        memory: "200Mi"
        cpu: "700m"
      requests:
        memory: "200Mi"
        cpu: "700m"
EOF

kubectl exec -it qos-demo -- bash

Which works fine.

BenTheElder commented 1 year ago

OK you have to leave it running for a bit, I see this on the above configured v0.17 cluster now, after trying again and waiting a few minutes before exec-ing again:

$ kubectl exec -it qos-demo -- bash
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "b0836e4a6b3a974ccf0ee2320a95d0c230a5cbfcfb5de41b07f19c820d2e0bf4": OCI runtime exec failed: exec failed: unable to start container process: open /dev/ptmx: operation not permitted: unknown
BenTheElder commented 1 year ago

Whereas the same configuration on v0.18 does not have this even after a few minutes.

BenTheElder commented 1 year ago

On v0.18 with @Belpaire's config from https://github.com/kubernetes-sigs/kind/issues/2999#issuecomment-1499076887, but brought down to a single node:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
  - |-
    kind: KubeletConfiguration
    cpuManagerPolicy: "static"
    reservedSystemCPUs: "0,1,2,3,4,5,6"

And using the test snippet above https://github.com/kubernetes-sigs/kind/issues/2999#issuecomment-1515356104

I'm not seeing the issue.

@Belpaire You mention:

I ran into the same issue with kind 0.18.0 (which I tried because it was the first kind release compatible with kubernetes 1.26 which has cpumanager as GA), reproducing with the following minimal yaml:

But we have 1.26 in https://github.com/kubernetes-sigs/kind/releases/tag/v0.17.0#new-features

Is there any chance you were using v0.17 and the 1.26.0 image? So far I can't reproduce, but v0.17 definitely has the issue described in https://github.com/kubernetes-sigs/kind/issues/2999#issue-1444163544, which appears to be fixed now in v0.18 images as outlined in https://github.com/kubernetes-sigs/kind/issues/2999#issuecomment-1501181087 / https://github.com/kubernetes-sigs/kind/issues/2999#issuecomment-1501218655.

Belpaire commented 1 year ago

@BenTheElder I retried it yesterday and it indeed seemed to bring up a cluster without issues, I maybe must have gotten confused while testing different images and kind versions for our setup. I scheduled some pods etc. and didn't get any /dev/ptmx issues, only difference I had was that my ubuntu went from 5.15.0-67-generic to 5.15.0-69-generic, but seems very doubtful that had any impact. So I think I must have been still trying with 0.17.0 or 0.18.0 and a wrong image somehow.

BenTheElder commented 1 year ago

Thanks!

I believe we can close this now as fixed by the runc upgrade in v0.18+ images.

sorry this took so long 😅

tobybellwood commented 1 year ago

Can confirm that this appears to have resolved my observed issues too. Thanks for the update!