Open mbana opened 1 month ago
Is there something pointing to cgroupfs as the issue here?
I'm not 100% sure yaml anchors are supported. Or whether you need the config patches. I would start by simplifying things and just trying to create a single node cluster with default settings and see if there is an issue with your docker configuration or something else in your environment before adding multiple nodes and extra configuration.
If that fails, it would be useful to try again with kind create cluster --retain
, kind export logs
, then kind delete cluster
. The exported logs should have a lot of detail that would help digging in to the actual root cause of the failure.
What gives? Why can't I use cgroupfs?
The cgroup driver has to match in the CRI implementation (containerd here) and in kubelet.
Why are you using cgroupfs? KIND is pretty sensitive to cgroup configurations and we don't test with this.
Is there something pointing to cgroupfs as the issue here?
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
I thought the line above indicated this.
I'm not 100% sure yaml anchors are supported. Or whether you need the config patches. I would start by simplifying things and just trying to create a single node cluster with default settings and see if there is an issue with your docker configuration or something else in your environment before adding multiple nodes and extra configuration.
There is nothing wrong with my Docker environmental, I believe. I simply changed:
"exec-opts": ["native.cgroupdriver=systemd"],
to
"exec-opts": ["native.cgroupdriver=cgroupfs"],
If that fails, it would be useful to try again with kind create cluster --retain, kind export logs, then kind delete cluster. The exported logs should have a lot of detail that would help digging in to the actual root cause of the failure.
I can do that but I shared a log statement indicating that it thinks cgroups
is disabled but it is not.
The cgroup driver has to match in the CRI implementation (containerd here) and in kubelet.
mmm ... I am using the nvidia-container-runtime
. Its configuration is below:
$ cat /etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = true
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false
[nvidia-ctk]
path = "nvidia-ctk"
Are you indicated that this is perhaps the error?
Why are you using cgroupfs? KIND is pretty sensitive to cgroup configurations and we don't test with this.
I am deploying Slurm in Kubernetes and it uses cgroups
as documented at https://slurm.schedmd.com/cgroups.html.
The cgroup driver has to match in the CRI implementation (containerd here) and in kubelet. mmm ... I am using the nvidia-container-runtime. Its configuration is below:
That's on your host. The configuration in kind nodes for both containerd and kubelet has to match, you're only patching kubelet in kind and docker on your host.
Re: nvidia-container-runtime, checkout https://github.com/klueska/nvkind
We're looking into CDI but there are some complications with kind (https://github.com/kubernetes-sigs/kind/pull/3290) and with the nvkind guide you can use GPUs with kind as-is.
I can do that but I shared a log statement indicating that it thinks cgroups is disabled but it is not.
That log statement is useless, it's just kubeadm giving suggestions as to why kubelet might not have started, it doesn't say anything about why it actually didn't start. It's a generic hint. We cannot debug this without providing the exporting logs, but I can already tell you from your configuration that containerd is not being configured for cgroupfs while kubelet is, which will not work. kind uses systemd for the cgroup driver, as recommended by SIG node.
I am deploying Slurm in Kubernetes and it uses cgroups as documented a
cgroups != cgroupfs, systemd cgroup driver still uses cgroup ..
I don't work with Slurm, but skimming that page I don't see where it can't work under systemd, I'd recommend enabling cgroup v2 unified.
There's an example here of patching containerd config https://kind.sigs.k8s.io/docs/user/local-registry/
but we do not test or support cgroupfs mode, so I'm not planning to add a guide for this in the docs, as it will increase support issues for something 99.99% of users should not do and their applications / kubernetes usage should not be aware of, kind / kubernetes / systemd manages the cgroups and we have to employ some workarounds to make this work properly.
Info
Config
Logs
These are note worthy logs:
What gives? Why can't I use
cgroupfs
?