kubernetes-sigs / kind

Kubernetes IN Docker - local clusters for testing Kubernetes
https://kind.sigs.k8s.io/
Apache License 2.0
13.49k stars 1.56k forks source link

v0.20.0 cannot create clusters on RHEL 7 #3311

Closed ncouse closed 5 months ago

ncouse commented 1 year ago

We are using RHEL7 VMs and have been successfully using these with KinD for quite some time (thank you).

When trying to upgrade to 0.20.0, the cluster fails to install.

Docker is using cgroups v1, and kernel is a rather antique version 3.10.0-1160.81.1.el7.x86_64.

The root problem seems to be the use of --cgroupns=private, which will not work in this environment. I presume the issue is with kernel support for the feature - I believe 4.6 is required.

While RHEL 7 is quite old, it is still in support, even with the old Kernel 3.

What happened:

KinD 0.20.0 fails to install on a RHEL 7 VM (kernel 3.10.0).

What you expected to happen:

Cluster to be created.

How to reproduce it (as minimally and precisely as possible):

$ kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 ✗ Preparing nodes 📦
Deleted nodes: ["kind-control-plane"]
ERROR: failed to create cluster: command "docker run --name kind-control-plane --hostname kind-control-plane --label io.x-k8s.kind.role=control-plane --privileged --security-opt seccomp=unconfined --security-opt apparmor=unconfined --tmpfs /tmp --tmpfs /run --volume /var --volume /lib/modules:/lib/modules:ro -e KIND_EXPERIMENTAL_CONTAINERD_SNAPSHOTTER --detach --tty --label io.x-k8s.kind.cluster=kind --net kind --restart=on-failure:1 --init=false --cgroupns=private --volume /dev/mapper:/dev/mapper --publish=127.0.0.1:33587:6443/TCP -e KUBECONFIG=/etc/kubernetes/admin.conf kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72" failed with error: exit status 125
Command Output: WARNING: Your kernel does not support cgroup namespaces.  Cgroup namespace setting discarded.
83f54548a6e5f603f7eac309719806364f9d7c226c77849a07cd363773f40d4b
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: cgroup namespaces aren't enabled in the kernel: unknown.

Anything else we need to know?:

It works in other environments. Other enviroments I have access to support cgroups v2, with modern kernels.

Environment:

parkjeongryul commented 1 year ago

Same here.

BenTheElder commented 1 year ago

The cgroupns=private is a 20.10.0+ feature (circa 2020), and even older for podman.

This is unfortunate :/

Switching to private cgroupns all the time makes the project's cgroups hackery a lot more reasonable.

However we've seen other broken environments (alpine) and will be revisiting this requirement in the short term. Longer term I think cgroups v2 will be a hard requirement wether we want it or not because the ecosystem is moving on.

ncouse commented 1 year ago

So unfortunately RHEL 7 is stuck on kernel 3.10, which means that the cgroupns=private feature cannot be used, even though we have a recent enough verison of docker to support it.

Even trying to create any container with that feature fails:

$ docker run -ti --rm --cgroupns=private alpine
WARNING: Your kernel does not support cgroup namespaces.  Cgroup namespace setting discarded.
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: cgroup namespaces aren't enabled in the kernel: unknown.

On RHEL8, it uses kernel 4.18, which is sufficient for this feature, and that will also use cgroups v2.

While I understand the project want to use this feature, with good reason, would it be possible to have a flag to disable its use, for use on older environments?

mybeloved commented 1 year ago

I have the same problem, would it be possible to have a flag to forbid '--cgroupns' or just let '--cgroupns=host' to ensure it works correctly?

BenTheElder commented 1 year ago

While I understand the project want to use this feature, with good reason, would it be possible to have a flag to disable its use, for use on older environments?

If we're going to do this we might as well just do it by default without adding another flag, because we're still stuck supporting non-namespaced cgroups and all the issues those bring anyhow.

I'm somewhat (not entirely ... undecided) disinclined to support RHEL given it's no longer an environment we can replicate after the recent centOS shenanigans from RedHat. MacOS is at least available in actions (and maintainer's local machines) and Windows is currently similarly receiving primarily community support.

It looks like RHEL7 will be out of support in less than a year and RHEL 8 will seemingly not have this issue, which is something else to consider ... 🤔

Sorry, both Antonio and I have been out recently and there's a lot to catch up on.

anthosz commented 1 year ago

Hello,

FYI, same behaviour on Amazon Linux V2 (more/less based on RHEL7), not tested on Amazon Linux 2023.

Workaround in progress -> move to Ubuntu 22

BenTheElder commented 11 months ago

Seems likely https://github.com/kubernetes-sigs/kind/issues/3442 is related, given CentOS 7.9 which I assume roughly equals RHEL 7.

Kubernetes is likely going to stop supporting RHEL7 Kernels anyhow, I would strongly recommend moving to a newer OS: https://github.com/kubernetes/kubernetes/issues/116799#issuecomment-1810865981

zhangtong007 commented 8 months ago

I also encountered the same problem on the Centos7 system. Is there any way to avoid this problem? Maybe a temporary solution? image image

anthosz commented 8 months ago

I also encountered the same problem on the Centos7 system. Is there any way to avoid this problem? Maybe a temporary solution? image ![image](https://private-user-images.githubusercontent.com/91560756/307588826-57a1d544-8f60-491f-a8e3-407c25da6154.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDg4NjY3MjgsIm5iZiI6MTcwODg2NjQyOCwicGF0aCI6Ii85MTU2MDc1Ni8zMDc1ODg4MjYtNTdhMWQ1NDQtOGY2MC00OTFmLWE4ZTMtNDA3YzI1ZGE2MTU0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAyMjUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMjI1VDEzMDcwOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQzY2Y2YzMzYzJiZjY2ODE3YzZmYWYxNTA5MWM3YTVmM2E4MGEzYWY2NDJjZmYyNGJkMTljNzM3OGIyYjNjZjImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.tvZvE8A_bPvW5PRB0M1Zjzuf4hT1V9iqw1n_RDVW2fM

Move to Ubuntu = issue fixed. In addition it allow you to do major upgrade without the need to reinstall

Romain-Geissler-1A commented 8 months ago

I doubt someone is going to invest much time to try to fix this issue. RHEL 7 end of "normal" support is in end of June, so 4 more months. After this, I have no doubt some companies will pay the extended support till 2028, but these companies will have to make a choice: running "recent" cloud related development tools on 10+ years old OS, maybe it's not the most rational situation ;)

Being affected by this in a company with thousands of developers currently moving to cloud tools, here is what we are doing internally:

BenTheElder commented 8 months ago

Right, I can't speak for everyone contributing but I just can't see choosing to prioritize this above everything else, even setting aside EOL release, the reason this is broken is because the kernel is too old. Kubernetes, containerd, runc, etc are not tested on RHEL7 to my knowledge and expect a somewhat more reasonably current kernel. I expect the ecosystem will start to require cgroupsv2 at some point in the not too distant future.

zhangtong007 commented 8 months ago

I also encountered the same problem on the Centos7 system. Is there any way to avoid this problem? Maybe a temporary solution? image ![image](https://private-user-images.githubusercontent.com/91560756/307588826-57a1d544-8f60-491f-a8e3-407c25da6154.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDg4NjY3MjgsIm5iZiI6MTcwODg2NjQyOCwicGF0aCI6Ii85MTU2MDc1Ni8zMDc1ODg4MjYtNTdhMWQ1NDQtOGY2MC00OTFmLWE4ZTMtNDA3YzI1ZGE2MTU0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAyMjUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMjI1VDEzMDcwOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQzY2Y2YzMzYzJiZjY2ODE3YzZmYWYxNTA5MWM3YTVmM2E4MGEzYWY2NDJjZmYyNGJkMTljNzM3OGIyYjNjZjImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.tvZvE8A_bPvW5PRB0M1Zjzuf4hT1V9iqw1n_RDVW2fM

Move to Ubuntu = issue fixed. In addition it allow you to do major upgrade without the need to reinstall

yeah,I re-tested it on Ubuntu and it‘s already supported. thanks!

ncouse commented 8 months ago

@BenTheElder Yes, I understand that RHEL7 support is less priority for you, and can appreciate that. I originally raised this in hopes of a simple workaround.

It is unfortunate, given the number of replies, that many people are stuck on RHEL7 for various reasons.

Beside the kernel, there are other issues on RHEL7 also, such as older versions of libraries like glibc that are giving issues, so it is a problematic platform.

Migration to RHEL 9 (or other platforms) is the obvious solution, but of course that won't work for everyone.

KubeKyrie commented 6 months ago

Same problem. kind v0.20.0 cannot create clusters on CentOS 7.9, kernel 3.10.0.

BenTheElder commented 6 months ago

Note that the ecosystem is moving away from cgroups v1 which will necessitate a newer kernel

https://github.com/kubernetes-sigs/kind/issues/3558#issuecomment-2040823712

One option might be developing Kubernetes things inside of a VM with a newer kernel if you can't upgrade the host.

anthosz commented 6 months ago

I guess this issue can be closed no?

Solution: upgrade your OS

BenTheElder commented 6 months ago

We'd be willing to consider reasonable proposed solutions if others wish to dig in and come up with something, and we still intermittently see more users with this issue.

At minimum to close it we'd have to add an entry here https://kind.sigs.k8s.io/docs/user/known-issues/ (we probably should anyhow but E_TOO_MUCH_TO_DO)

pwyp commented 5 months ago

Same problem.

Citrix VDI + RHEL 7.9 (Maipo) Kernel: Linux 3.10.0-1160.114.2.el7.x86_64

Unfortunately migration to RHEL 9 is not an option to me. From the other hand it works perfectly on Win10 WSL2 + Ubuntu 22

$ kind version kind v0.23.0 go1.21.10 linux/amd64

$ kind create cluster Creating cluster "kind" ... ✓ Ensuring node image (kindest/node:v1.30.0) 🖼 ✗ Preparing nodes 📦
Deleted nodes: ["kind-control-plane"] ERROR: failed to create cluster: command "docker run --name kind-control-plane --hostname kind-control-plane --label io.x-k8s.kind.role=control-plane --privileged --security-opt seccomp=unconfined --security-opt apparmor=unconfined --tmpfs /tmp --tmpfs /run --volume /var --volume /lib/modules:/lib/modules:ro -e KIND_EXPERIMENTAL_CONTAINERD_SNAPSHOTTER --detach --tty --label io.x-k8s.kind.cluster=kind --net kind --restart=on-failure:1 --init=false --cgroupns=private --volume /dev/mapper:/dev/mapper --publish=127.0.0.1:43665:6443/TCP -e KUBECONFIG=/etc/kubernetes/admin.conf kindest/node:v1.30.0@sha256:047357ac0cfea04663786a612ba1eaba9702bef25227a794b52890dd8bcd692e" failed with error: exit status 125 Command Output: WARNING: Your kernel does not support cgroup namespaces. Cgroup namespace setting discarded. c55774d6753f3d8e257fb4f1dae6c10d12db12b44a933f65649da6df0c7351df docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: cgroup namespaces aren't enabled in the kernel: unknown.

BenTheElder commented 5 months ago

Please refrain from "same problem" comments that don't add new information to the discussion.

We're aware that RHEL 7 is not supported and does not work because the kernel is too old and does not support a required kernel feature (cgroup namespaces, introduced eight years ago https://lkml.org/lkml/2016/3/26/132/) that we adopted to work around other breaking changes in the cgroup v1 ecosystem. Someone will have to spend time designing a reasonable workaround that does not make kind less reliable for currently supported hosts and then we can review it.

I don't plan to design this myself as these old kernels aren't a priority for me personally, have alternatives available, and the assorted related projects are discussing cgroups v1 EOL anyhow and we cannot exceed the support of our dependencies etc.

Please see the above discussion.

pwyp commented 5 months ago

I can workaround the error on RHEL7 by replacing --cgroupns=private parameter with --cgroupns=host while executing the failing docker command manually from console (see my previous comment).

Overall this does not help much because even if I create 'kind-control-plane' manually using docker run command

$ docker run --name kind-control-plane --hostname kind-control-plane --label io.x-k8s.kind.role=control-plane (here comes the rest of args)

$ kind get nodes kind-control-plane

$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 1d6a0930376c kindest/node:v1.30.0 "/usr/local/bin/entr…" About a minute ago Up About a minute 127.0.0.1:43665->6443/tcp kind-control-plane

kind still does a lot more under the hood while creating a cluster and simple re-running cluster creation once again ends up with another error

$ kind create cluster --verbosity 5 ERROR: failed to create cluster: node(s) already exist for a cluster with the name "kind" Stack Trace: sigs.k8s.io/kind/pkg/errors.Errorf sigs.k8s.io/kind/pkg/errors/errors.go:41 sigs.k8s.io/kind/pkg/cluster/internal/create.alreadyExists sigs.k8s.io/kind/pkg/cluster/internal/create/create.go:182 sigs.k8s.io/kind/pkg/cluster/internal/create.Cluster sigs.k8s.io/kind/pkg/cluster/internal/create/create.go:80 sigs.k8s.io/kind/pkg/cluster.(Provider).Create sigs.k8s.io/kind/pkg/cluster/provider.go:192 sigs.k8s.io/kind/pkg/cmd/kind/create/cluster.runE sigs.k8s.io/kind/pkg/cmd/kind/create/cluster/createcluster.go:110 sigs.k8s.io/kind/pkg/cmd/kind/create/cluster.NewCommand.func1 sigs.k8s.io/kind/pkg/cmd/kind/create/cluster/createcluster.go:54 github.com/spf13/cobra.(Command).execute github.com/spf13/cobra@v1.4.0/command.go:856 github.com/spf13/cobra.(Command).ExecuteC github.com/spf13/cobra@v1.4.0/command.go:974 github.com/spf13/cobra.(Command).Execute github.com/spf13/cobra@v1.4.0/command.go:902 sigs.k8s.io/kind/cmd/kind/app.Run sigs.k8s.io/kind/cmd/kind/app/main.go:53 sigs.k8s.io/kind/cmd/kind/app.Main sigs.k8s.io/kind/cmd/kind/app/main.go:35 main.main sigs.k8s.io/kind/main.go:25 runtime.main runtime/proc.go:267 runtime.goexit runtime/asm_amd64.s:1650

  1. Is there any way to force 'kind' to implicitly make use of --cgroupns=host rather than --cgroupns=private while creating a cluster? EDIT: I guess not as already discussed above (I missed that point somehow)
  2. Or maybe 'kind' could accept already existing 'kind-control-plane' and proceed rather than end up with the above error? These are just questions from an end-user point of view and I cannot say if such workarounds would have implications for reliability.
BenTheElder commented 5 months ago

Is there any way to force 'kind' to implicitly make use of --cgroupns=host rather than --cgroupns=private while creating a cluster? EDIT: I guess not as already discussed https://github.com/kubernetes-sigs/kind/issues/3311#issuecomment-1662715880 (I missed that point somehow)

No, and the reason we require cgroupns is because otherwise there is more leaky behavior from the host cgroups that frequently outright breaks kind. cgroupns=private solves this and it is actually the default in docker / podman on cgroup v2 hosts.

If we disable this feature then it just won't work on newer hosts (and may not work reliably on these old hosts either, even if it appears to bring up a cluster), and if we make it customizable users will start to depend on this detail even though it shouldn't even be allowed on cgroup v2 (with the nested hierarchy this makes no sense) and causes broken behavior on v1.

MAYBE we could automatically do this as a fallback after parsing the error, but this is brittle, slow, and we've already been moving to make the internals of the node setup more maintainable by dropping all the broken attempts at working around hostns issues.

Or maybe 'kind' could accept already existing 'kind-control-plane' and proceed rather than end up with the above error? These are just questions from an end-user point of view and I cannot say if such workarounds would have implications for reliability.

kind create cluster would not. It is responsible for creating the containers and the options it uses are an implementation detail that the further steps depend on.

pwyp commented 5 months ago

I see the point now. Thank you for clarification and sharing valuable insights.

lowang-bh commented 5 months ago

same problem

command Output: WARNING: Your kernel does not support cgroup namespaces.  Cgroup namespace setting discarded.
622723da818fc19f164cdfec877be110348797b33aff47a82cb183177b64ee99
docker: Error response from daemon: OCI runtime create failed: cgroup namespaces aren't enabled in the kernel

kind v0.20.0 go1.20.4 linux/amd64 docker version 20.10.11 kernel 3.10.0

stmcginnis commented 5 months ago

Any reason to keep this issue open? Not sure if there are any actions here.

anthosz commented 5 months ago

Any reason to keep this issue open? Not sure if there are any actions here.

We'd be willing to consider reasonable proposed solutions if others wish to dig in and come up with something, and we still intermittently see more users with this issue.

At minimum to close it we'd have to add an entry here https://kind.sigs.k8s.io/docs/user/known-issues/ (we probably should anyhow but E_TOO_MUCH_TO_DO)

ncouse commented 5 months ago

I am presuming this will not be addressed and therefore can be closed. I didn't close myself, in case there were actions you wanted to take.