kubernetes-sigs / kind

Kubernetes IN Docker - local clusters for testing Kubernetes
https://kind.sigs.k8s.io/
Apache License 2.0
13.34k stars 1.55k forks source link

How to use `cgroupfs` as the `cgroupDriver`? #3700

Open mbana opened 1 month ago

mbana commented 1 month ago

Info

$ cat /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "exec-opts": ["native.cgroupdriver=cgroupfs"],
    "bip": "192.168.99.1/24",
    "default-shm-size": "1G",
    "log-driver": "json-file",
    "log-opts": {
        "max-size": "100m",
        "max-file": "1"
    },
    "default-ulimits": {
        "memlock": {
            "hard": -1,
            "name": "memlock",
            "soft": -1
        },
        "stack": {
            "hard": 67108864,
            "name": "stack",
            "soft": 67108864
        }
    }
}
$ kind --version
kind version 0.23.0
$ docker --version
Docker version 26.1.4, build 5650f9b
$ docker info
Client: Docker Engine - Community
 Version:    26.1.4
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.14.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 81
  Running: 4
  Paused: 0
  Stopped: 77
 Images: 111
 Server Version: 26.1.4
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 nvidia
 Default Runtime: nvidia
 Init Binary: docker-init
 containerd version: d2d58213f83a351ca8f528a95fbd145f5654e957
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.5.0-35-generic
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 31.33GiB
 Name: mbana-1
 ID: 26df3d83-eb15-4d8c-914e-4284e0aca1b6
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Config

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  # Control Plane
  - role: control-plane
    # Version list and SHA hashes available at https://github.com/kubernetes-sigs/kind/releases.
    image: &image kindest/node:v1.30.0@sha256:047357ac0cfea04663786a612ba1eaba9702bef25227a794b52890dd8bcd692e
    kubeadmConfigPatches:
      - |
        kind: KubeletConfiguration
        CgroupDriver: cgroupfs
        cgroupDriver: cgroupfs
        kubeletExtraArgs:
          CgroupDriver: cgroupfs
          cgroupDriver: cgroupfs
  # Misc worker node
  - role: worker
    image: *image
    kubeadmConfigPatches:
      - |
        kind: KubeletConfiguration
        CgroupDriver: cgroupfs
        cgroupDriver: cgroupfs
        kubeletExtraArgs:
          CgroupDriver: cgroupfs
          cgroupDriver: cgroupfs
  - &worker
    role: worker
    labels:
      kind.bana.io/nodes: e2e
    image: *image
    kubeadmConfigPatches:
      - |
        kind: KubeletConfiguration
        CgroupDriver: cgroupfs
        cgroupDriver: cgroupfs
        kubeletExtraArgs:
          CgroupDriver: cgroupfs
          cgroupDriver: cgroupfs
        ---
        kind: JoinConfiguration
        nodeRegistration:
          taints:
          - key: kind.bana.io/nodes
            effect: NoSchedule
  - *worker

Logs

These are note worthy logs:

...
---
CgroupDriver: cgroupfs
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: cgroupfs
cgroupRoot: /kubelet
evictionHard:
  imagefs.available: 0%
  nodefs.available: 0%
  nodefs.inodesFree: 0%
failSwapOn: false
imageGCHighThresholdPercent: 100
kind: KubeletConfiguration
kubeletExtraArgs:
  CgroupDriver: cgroupfs
  cgroupDriver: cgroupfs
---
...
This error is likely caused by:
    - The kubelet is not running
    - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
...

What gives? Why can't I use cgroupfs?

$ docker info -f {{.CgroupDriver}}
cgroupfs
stmcginnis commented 1 month ago

Is there something pointing to cgroupfs as the issue here?

I'm not 100% sure yaml anchors are supported. Or whether you need the config patches. I would start by simplifying things and just trying to create a single node cluster with default settings and see if there is an issue with your docker configuration or something else in your environment before adding multiple nodes and extra configuration.

If that fails, it would be useful to try again with kind create cluster --retain, kind export logs, then kind delete cluster. The exported logs should have a lot of detail that would help digging in to the actual root cause of the failure.

BenTheElder commented 1 month ago

What gives? Why can't I use cgroupfs?

The cgroup driver has to match in the CRI implementation (containerd here) and in kubelet.

Why are you using cgroupfs? KIND is pretty sensitive to cgroup configurations and we don't test with this.

mbana commented 1 month ago

Is there something pointing to cgroupfs as the issue here?

- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

I thought the line above indicated this.

I'm not 100% sure yaml anchors are supported. Or whether you need the config patches. I would start by simplifying things and just trying to create a single node cluster with default settings and see if there is an issue with your docker configuration or something else in your environment before adding multiple nodes and extra configuration.

There is nothing wrong with my Docker environmental, I believe. I simply changed:

    "exec-opts": ["native.cgroupdriver=systemd"],

to

    "exec-opts": ["native.cgroupdriver=cgroupfs"],

If that fails, it would be useful to try again with kind create cluster --retain, kind export logs, then kind delete cluster. The exported logs should have a lot of detail that would help digging in to the actual root cause of the failure.

I can do that but I shared a log statement indicating that it thinks cgroups is disabled but it is not.

The cgroup driver has to match in the CRI implementation (containerd here) and in kubelet.

mmm ... I am using the nvidia-container-runtime. Its configuration is below:

$ cat /etc/nvidia-container-runtime/config.toml                                                                                                    
accept-nvidia-visible-devices-as-volume-mounts = true

#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"

Are you indicated that this is perhaps the error?

Why are you using cgroupfs? KIND is pretty sensitive to cgroup configurations and we don't test with this.

I am deploying Slurm in Kubernetes and it uses cgroups as documented at https://slurm.schedmd.com/cgroups.html.

BenTheElder commented 1 month ago

The cgroup driver has to match in the CRI implementation (containerd here) and in kubelet. mmm ... I am using the nvidia-container-runtime. Its configuration is below:

That's on your host. The configuration in kind nodes for both containerd and kubelet has to match, you're only patching kubelet in kind and docker on your host.

Re: nvidia-container-runtime, checkout https://github.com/klueska/nvkind

We're looking into CDI but there are some complications with kind (https://github.com/kubernetes-sigs/kind/pull/3290) and with the nvkind guide you can use GPUs with kind as-is.

I can do that but I shared a log statement indicating that it thinks cgroups is disabled but it is not.

That log statement is useless, it's just kubeadm giving suggestions as to why kubelet might not have started, it doesn't say anything about why it actually didn't start. It's a generic hint. We cannot debug this without providing the exporting logs, but I can already tell you from your configuration that containerd is not being configured for cgroupfs while kubelet is, which will not work. kind uses systemd for the cgroup driver, as recommended by SIG node.

I am deploying Slurm in Kubernetes and it uses cgroups as documented a

cgroups != cgroupfs, systemd cgroup driver still uses cgroup ..

I don't work with Slurm, but skimming that page I don't see where it can't work under systemd, I'd recommend enabling cgroup v2 unified.

BenTheElder commented 1 month ago

There's an example here of patching containerd config https://kind.sigs.k8s.io/docs/user/local-registry/

but we do not test or support cgroupfs mode, so I'm not planning to add a guide for this in the docs, as it will increase support issues for something 99.99% of users should not do and their applications / kubernetes usage should not be aware of, kind / kubernetes / systemd manages the cgroups and we have to employ some workarounds to make this work properly.