Closed felipecrs closed 2 years ago
Thanks for filing this one up @felipecrs.
I don't think we've seen this one before, so wondering if it's triggered by a Kind's recent change ...
Can you please indicate which kind
release are you running so that we can try to replicate it?
I just updated the issue with this information. It's the newest, 0.11.1.
Ah ok.
I suspect that's probably related to the failure that you're seeing as kind is relying on k8s' 1.21 release, which we don't fully support yet.
Can you try to reproduce with a Kind release that makes use of k8s' 1.20 release? I think Kind's 0.10 is the right one.
Sure
Actually, hold it. Let me think a bit more about the problem, the k8s release may not be playing any role here as we're talking about the "inner" k8s instance and not the outer.
Will try to reproduce when have some cycles.
I also thought so, but here are the logs anyway:
$ kind create cluster --image kindest/node:v1.20.7@sha256:cbeaf907fc78ac97ce7b625e4bf0de16e3ea725daf6b04f930bd14c67c671ff9
Creating cluster "kind" ...
β β Ensuring node image (kindest/node:v1.20.7) πΌ WARN[2021-10-09T05:35:32.773861358Z] reference for unknown type: digest="sha256:cbeaf907fc78ac97ce7b625e4bf0de16e3ea725daf6b04f930bd14c67c671ff9" remote="docker.io/kindest/node@sha256:cbeaf907fc78ac97ce7b625e4bf0de16e3ea725daf6b04f930bd14c67c671ff9"
β’β‘± Ensuring node image (kindest/node:v1.20.7) πΌ ERRO[2021-10-09T05:36:09.525430805Z] Could not add route to IPv6 network fc00:f853:ccd:e793::1/64 via device br-9ef6aedd0046: network is down
β Ensuring node image (kindest/node:v1.20.7) πΌ
β’β Preparing nodes π¦ time="2021-10-09T05:36:21.839367999Z" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/5913e55ff4c3dfcd410db742a0c685a20b1d43079cc2b4df2e5519b6b00fb2ef pid=8862
β Preparing nodes π¦
INFO[2021-10-09T05:36:22.683712913Z] ignoring event container=5913e55ff4c3dfcd410db742a0c685a20b1d43079cc2b4df2e5519b6b00fb2ef module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2021-10-09T05:36:22.683837061Z] shim disconnected id=5913e55ff4c3dfcd410db742a0c685a20b1d43079cc2b4df2e5519b6b00fb2ef
ERRO[2021-10-09T05:36:22.683877169Z] copy shim log error="read /proc/self/fd/14: file already closed"
time="2021-10-09T05:36:23.075052159Z" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/5913e55ff4c3dfcd410db742a0c685a20b1d43079cc2b4df2e5519b6b00fb2ef pid=9191
β’β‘± Writing configuration π INFO[2021-10-09T05:36:23.826452876Z] shim disconnected id=5913e55ff4c3dfcd410db742a0c685a20b1d43079cc2b4df2e5519b6b00fb2ef
INFO[2021-10-09T05:36:23.826512625Z] ignoring event container=5913e55ff4c3dfcd410db742a0c685a20b1d43079cc2b4df2e5519b6b00fb2ef module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
ERRO[2021-10-09T05:36:23.826598529Z] copy shim log error="read /proc/self/fd/14: file already closed"
β’β‘ Writing configuration π ERRO[2021-10-09T05:36:24.160774319Z] Error setting up exec command in container kind-control-plane: Container 5913e55ff4c3dfcd410db742a0c685a20b1d43079cc2b4df2e5519b6b00fb2ef is not running
β Writing configuration π
ERROR: failed to create cluster: failed to generate kubeadm config content: failed to get kubernetes version from node: failed to get file: command "docker exec --privileged kind-control-plane cat /kind/version" failed with error: exit status 1
Command Output: Error response from daemon: Container 5913e55ff4c3dfcd410db742a0c685a20b1d43079cc2b4df2e5519b6b00fb2ef is not running
Which is the same thing.
Ok, a better insight:
jenkins@dind:~$ docker run kindest/node:v1.20.7@sha256:cbeaf907fc78ac97ce7b625e4bf0de16e3ea725daf6b04f930bd14c67c671ff9
time="2021-10-09T05:37:41.783364868Z" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/192bbf079c6ad7f6e3be91eb4de17e0ac903b9a7fa5bd1587ac8051c4e303650 pid=9647
INFO: running in a user namespace (experimental)
ERROR: UserNS: cgroup v2 needs to be enabled
INFO[2021-10-09T05:37:42.237079084Z] shim disconnected id=192bbf079c6ad7f6e3be91eb4de17e0ac903b9a7fa5bd1587ac8051c4e303650
INFO[2021-10-09T05:37:42.237111730Z] ignoring event container=192bbf079c6ad7f6e3be91eb4de17e0ac903b9a7fa5bd1587ac8051c4e303650 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
ERRO[2021-10-09T05:37:42.237135275Z] copy shim log error="read /proc/self/fd/14: file already closed"
The logs are mixed with my dockerd, sorry. The ones which starts with ERRO[
or INFO[
comes from dockerd
.
Ok, it works with kind v0.10.0. It must have something to do with their adoption to cgroups v2, as reference at here and here.
$ kind create cluster
Creating cluster "kind" ...
β β Ensuring node image (kindest/node:v1.20.2) πΌ WARN[2021-10-09T05:47:47.292449520Z] reference for unknown type: digest="sha256:8f7ea6e7642c0da54f04a7ee10431549c0257315b3a634f6ef2fecaaedb19bab" remote="docker.io/kindest/node@sha256:8f7ea6e7642c0da54f04a7ee10431549c0257315b3a634f6ef2fecaaedb19bab"
β Ensuring node image (kindest/node:v1.20.2) πΌ
β’β‘± Preparing nodes π¦ time="2021-10-09T05:48:21.718270811Z" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/7df7dd4c50f1fc346be29aeb4f1a8723a9613848539421d86d83e8374f53ec21 pid=17255
β Preparing nodes π¦
β Writing configuration π
β Starting control-plane πΉοΈ
β Installing CNI π
β Installing StorageClass πΎ
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kind
Thanks for using kind! π
But it has nothing to do with the K8s version though.
Yes, that's what I was afraid of. Will look into that when have a chance. Hope you can rely on v0.10 for now.
As can be seen here, the problem seems to be that kind
is coupling the execution within a user-ns with the presence of a cgroup-v2 system. As you know, Sysbox enforces the utilization of user-ns for security reasons, and we do that regardless of cgroup-v1 or v2 being on the picture. That's to say that I'm not sure what we can do here given this kind
constrain ...
See that the latest kind
works perfectly fine in cgroup-v2 systems:
$ lsb_release -cd
Description: Debian GNU/Linux 11 (bullseye)
Codename: bullseye
$ sudo ls -lrt /sys/fs/cgroup/cgroup.controllers
-r--r--r-- 1 root root 0 Oct 9 13:25 /sys/fs/cgroup/cgroup.controllers
$ docker run --runtime=sysbox-runc -it --rm --name test-1 --hostname test-1 ghcr.io/nestybox/ubuntu-focal-systemd-docker:latest
...
admin@test-1:~$ ./kind create cluster --retain
Creating cluster "kind" ...
β Ensuring node image (kindest/node:v1.21.1) πΌ
β Preparing nodes π¦
β Writing configuration π
β Starting control-plane πΉοΈ
β Installing CNI π
β Installing StorageClass πΎ
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kind
Thanks for using kind! π
admin@test-1:~$
@felipecrs, while this 'rootless-identification' logic issue is fixed on kind
side, would you mind relying on our own compiled kindest/node
images?
We would make very simple adjustments to kind
official images to bypass the problem. That's how we have addressed similar issues in the past and we may need to do that again to workaround this issue for now for cgroup-v1 setups.
This is not a compromise I can take. I have no control about what the other developers do in their CI pipelines, and I would prefer to push this change more transparently.
As you said that it worked on an environment with cgroup v2, I'll investigate if I can tweak my nodes to support it instead.
Reading https://rootlesscontaine.rs/getting-started/common/cgroup2/, I think it would not be a so good idea to try to enable cgroupv2 in my Ubuntu 18.04 nodes. They recommend to have a systemd version of 244, while it's 237 in Ubuntu 18.04.
It will probably come enabled by default on Ubuntu 22.04, and on 20.04 the systemd version is good enough to enable manually.
I'll try to look in the kind's issue, to see if I can fix it and propose a PR.
As can be seen here, the problem seems to be that
kind
is coupling the execution within a user-ns with the presence of a cgroup-v2 system. As you know, Sysbox enforces the utilization of user-ns for security reasons, and we do that regardless of cgroup-v1 or v2 being on the picture.
I can't fully understand the problem, is it possible to run docker userns with cgroupsv1? AFAIK to run kind rootless you need cgroupsv2, why is that check wrong?
I will let @rodnymolina answer the first question.
But for the second, sysbox
already does the needful to encapsulate the root user in the inner container, so that kind
can work as if it was root
. The actual problem is that kind
is detecting as a rootless
environment while it shouldn't, despite as @rodnymolina said, it would work if cgroupv2 was present.
The actual problem is that
kind
is detecting as arootless
environment while it shouldn't, despite as @rodnymolina said, it would work if cgroupv2 was present.
that is the part I don't understand, rootless requires cgroupv2
By rootless
, is this what you mean: https://docs.docker.com/engine/security/rootless/?
If so: I'm not running the Docker Daemon on the host as rootless, and neither the Docker Daemon in the container. Both are running in normal non-rootless mode.
this is the code that detects kind runs in an user namespace https://github.com/kubernetes-sigs/kind/blob/main/images/base/files/usr/local/bin/entrypoint#L21-L25
# If /proc/self/uid_map 4294967295 mappings, we are in the initial user namespace, i.e. the host.
# Otherwise we are in a non-initial user namespace.
# https://github.com/opencontainers/runc/blob/v1.0.0-rc92/libcontainer/system/linux.go#L109-L118
userns=""
if grep -Eqv "0[[:space:]]+0[[:space:]]+4294967295" /proc/self/uid_map; then
userns="1"
echo 'INFO: running in a user namespace (experimental)'
fi
or kind userns detection is broken (that is a bug in kind) or something else is causing that kind "thinks" is running as rootless ...
If there is an alternative way to check whether dockerd
is running as rootless in the host, the kind
CLI can get and supply this information as an environment variable to the nodes to use during startup.
But I didn't find yet a proper way to do so, the output of docker version
and docker info
does not seem to contain this information.
@aojea, thanks for joining our conversation, appreciate your feedback ...
IMHO, the problem here is that KinD is coupling the semantics of rootless
with the one of unprivileged
containers. A runtime could require full privileges to operate (i.e. run as root
), and yet, be able to generate unprivileged
containers through the utilization of user namespaces. This is the model in which Sysbox operates -- similar to how docker does it when running in userns-remap
mode, or how LXC has done it since day one when creating unprivileged containers.
That's to say that KinD's root-init-userns detection logic seems correct to me, the problem I see is with the enforcement of cgroup-v2 when user-ns is active. Sysbox is capable of enforcing cgroup-v1 limits, so I don't see why cgroup-v2 must be required when user-ns are detected.
Hi @felipecrs:
Reading https://rootlesscontaine.rs/getting-started/common/cgroup2/, I think it would not be a so good idea to try to enable cgroupv2 in my Ubuntu 18.04 nodes. They recommend to have a systemd version of 244, while it's 237 in Ubuntu 18.04.
This may be fine, let me explain.
The configuration of cgroups for a container (either cgroups v1 or v2) can be done by having the container runtime directly program the cgroup filesystem (e.g., /sys/fs/cgroup), or by having the container runtime request systemd to manage the cgroup filesystem on its behalf. These are known as the "cgroupfs" and "systemd" cgroup drivers respectively.
In general the systemd cgroup driver approach is preferred, because it creates a single entity in the host managing the cgroups (i.e. systemd). But the cgroupfs driver works fine too.
Currently, when you install Sysbox on a Kubernetes cluster with sysbox-deploy-k8s, it also installs CRI-O as the runtime and configures it with the cgroupfs driver. In other words, systemd is not managing cgroups for the containers (though it does still manage cgroups for systemd services).
The fact that systemd is not manging the cgroups for the containers, coupled with the fact that systemd v244 is only needed for cgroup delegation (e.g., to allow containers to manage a cgroup subhierarchy), means that you should be able to configure cgroups v2 on you Ubuntu 18.04 hosts without problem.
In the near future, we will likely add logic to the sysbox-deploy-k8s to determine the version of systemd on the host and based on this select the best cgroup driver (e.g., cgroupfs or systemd). For hosts that carry systemd > v244, we would enable the systemd driver. Otherwise we would keep the cgroupfs approach.
Hope this clarifies.
This is not a compromise I can take. I have no control about what the other developers do in their CI pipelines, and I would prefer to push this change more transparently.
I fully understand @felipecrs. Luckily it seems that we'll have a KinD fix for this one soon.
Btw, not sure if you've noticed @AkihiroSuda suggestion in your PR, looks like the ideal solution to me too.
@ctalledo thanks a bunch for the elaborated answer. I went ahead and enabled cgroupv2 in one of my nodes for testing purpose. To enable it, I did:
systemd.unified_cgroup_hierarchy=1
to /etc/default/grub
in GRUB_CMDLINE_LINUX
sudo update-grub
sudo reboot
I intentionally skipped "Enabling CPU, CPUSET, and I/O delegation" because I believe Sysbox won't require it as Sysbox itself runs as root. I would appreciate your feedback in this decision.
And now, as expected, kind create cluster
is working. I'll just have to confirm with my IT if I'm allowed to do this change (because otherwise it would get reverted in the next scheduled system patch, even though I have root access on the node).
@rodnymolina I continued the discussion in the PR suggestion.
Hi @felipecrs:
That steps to enable cgroup v2 on your host look fine.
I intentionally skipped "Enabling CPU, CPUSET, and I/O delegation" because I believe Sysbox won't require it as Sysbox itself runs as root. I would appreciate your feedback in this decision.
Since you have systemd < v244, it's best not to enable cpu/cpuset/io delegation. In any case, Sysbox won't use it right now because it will manage the cgroups v2 directly via /sys/fs/cgroup
(i.e., cgroupfs driver).
Once you have a host with sytemd >= v244, then you should enable cpu/cpuset/io delegation. This way, Sysbox may use the systemd cgroup driver (the preferred approach going forward).
Hope that helps!
Got it. Thanks!
Hi @felipecrs,
FYI, we updated the nestybox/kindestnode
images to relax the cgroup v2 requirement. See the Dockerfile here.
This is a temporary work-around while we work to relax the cgroup v2 check on the official kind images (i.e., kindest/node
).
Given that the cgroup v2 requirement when running in a user-ns came from KinD, coupled with the fact that KinD is about to update the official image to relax this requirement, plus the work-around described above, I'll close this issue.
Please re-open if you disagree.
Awesome. Yes, agreed.
I didn't find this as a known limitation in the docs, or maybe I didn't check well enough.
On my CI pipelines, we often use
kind
for spinning up ephemeral clusters for testing purposes. When I try to executekind
inside of a Sysbox based pod, it fails with the following log:The following is what the
dockerd
(inside of the pod) logs says during the failure:More information:
On the node:
Which is the version built at https://github.com/nestybox/sysbox/issues/406#issuecomment-939190249.