k3d-io / k3d

Little helper to run CNCF's k3s in Docker
https://k3d.io/
MIT License
5.36k stars 456 forks source link

[BUG] Podman: "failed to find cpu cgroup (v2)" #1082

Open rgilton opened 2 years ago

rgilton commented 2 years ago

What did you do

I followed the instructions on using rootless podman from the k3d documentation.

What did you expect to happen

The cluster to start.

Screenshots or terminal output

[rob@f36vm1 ~]$ k3d cluster create --registry-use hive-registry hive
INFO[0000] Prep: Network                                
INFO[0000] Re-using existing network 'k3d-hive' (9093f5f999b9262e7e7cf068011acb42e4a60ea4dee5d6112c5d223dc2d0eeb8) 
INFO[0000] Created image volume k3d-hive-images         
INFO[0000] Container 'k3d-hive-registry' is already connected to 'k3d-hive' 
INFO[0000] Starting new tools node...                   
INFO[0000] Starting Node 'k3d-hive-tools'               
INFO[0001] Creating node 'k3d-hive-server-0'            
INFO[0001] Creating LoadBalancer 'k3d-hive-serverlb'    
INFO[0001] Using the k3d-tools node to gather environment information 
INFO[0001] HostIP: using network gateway 10.89.0.1 address 
INFO[0001] Starting cluster 'hive'                      
INFO[0001] Starting servers...                          
INFO[0001] Starting Node 'k3d-hive-server-0'            
WARN[0002] warning: encountered fatal log from node k3d-hive-server-0 (retrying 0/10): Mtime="2022-06-08T15:00:39Z" level=fatal msg="failed to find cpu cgroup (v2)" 
WARN[0002] warning: encountered fatal log from node k3d-hive-server-0 (retrying 1/10): Mtime="2022-06-08T15:00:39Z" level=fatal msg="failed to find cpu cgroup (v2)" 
WARN[0003] warning: encountered fatal log from node k3d-hive-server-0 (retrying 2/10): Mtime="2022-06-08T15:00:40Z" level=fatal msg="failed to find cpu cgroup (v2)" 
WARN[0005] warning: encountered fatal log from node k3d-hive-server-0 (retrying 3/10): Mtime="2022-06-08T15:00:42Z" level=fatal msg="failed to find cpu cgroup (v2)" 
WARN[0007] warning: encountered fatal log from node k3d-hive-server-0 (retrying 4/10): Mtime="2022-06-08T15:00:44Z" level=fatal msg="failed to find cpu cgroup (v2)" 
WARN[0007] warning: encountered fatal log from node k3d-hive-server-0 (retrying 5/10): Mtime="2022-06-08T15:00:44Z" level=fatal msg="failed to find cpu cgroup (v2)" 
WARN[0008] warning: encountered fatal log from node k3d-hive-server-0 (retrying 6/10): Mtime="2022-06-08T15:00:45Z" level=fatal msg="failed to find cpu cgroup (v2)" 
WARN[0009] warning: encountered fatal log from node k3d-hive-server-0 (retrying 7/10): Mtime="2022-06-08T15:00:46Z" level=fatal msg="failed to find cpu cgroup (v2)" 
WARN[0010] warning: encountered fatal log from node k3d-hive-server-0 (retrying 8/10): Mtime="2022-06-08T15:00:47Z" level=fatal msg="failed to find cpu cgroup (v2)" 
WARN[0012] warning: encountered fatal log from node k3d-hive-server-0 (retrying 9/10): Mtime="2022-06-08T15:00:49Z" level=fatal msg="failed to find cpu cgroup (v2)" 
ERRO[0013] Failed Cluster Start: Failed to start server k3d-hive-server-0: Node k3d-hive-server-0 failed to get ready: error waiting for log line `k3s is up and running` from node 'k3d-hive-server-0': stopped returning log lines 
ERRO[0013] Failed to create cluster >>> Rolling Back    
INFO[0013] Deleting cluster 'hive'                      
INFO[0013] Deleting 2 attached volumes...               
WARN[0013] Failed to delete volume 'k3d-hive-images' of cluster 'hive': failed to find volume 'k3d-hive-images': Error: No such volume: k3d-hive-images -> Try to delete it manually 
FATA[0013] Cluster creation FAILED, all changes have been rolled back! 
[rob@f36vm1 ~]$ 

Spying on the logs from one of the 'server' containers, the last few lines are:

time="2022-06-08T15:00:44Z" level=info msg="Node token is available at /var/lib/rancher/k3s/server/token"
time="2022-06-08T15:00:44Z" level=info msg="To join node to cluster: k3s agent -s https://10.89.0.14:6443 -t ${NODE_TOKEN}"
time="2022-06-08T15:00:44Z" level=info msg="Wrote kubeconfig /output/kubeconfig.yaml"
time="2022-06-08T15:00:44Z" level=info msg="Run: k3s kubectl"
time="2022-06-08T15:00:44Z" level=fatal msg="failed to find cpu cgroup (v2)"

This machine is using cgroups v2 as far as I can see (it is the Fedora 36 default):

[rob@f36vm1 ~]$ mount | grep cgr
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot)

Which OS & Architecture

All in a fresh Fedora 36 VM.

Which version of k3d

k3d version v5.4.3
k3s version v1.23.6-k3s1 (default)

Which version of docker

Using podman here:

Client:       Podman Engine
Version:      4.1.0
API Version:  4.1.0
Go Version:   go1.18
Built:        Fri May  6 12:15:54 2022
OS/Arch:      linux/amd64
radikaled commented 2 years ago

I also ran into this issue and was able to make some progress.

It looks like on Fedora 36 a non-root user does not have the cpuset delegation by default:

$ cat /sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers
cpu io memory pids

For reference: Enabling CPU, CPUSET, and I/O delegation

Once I enabled the cpuset delegation (as outlined in the above) success!

$ cat /sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers
cpuset cpu io memory pids
$ k3d cluster create
INFO[0000] Prep: Network                                
INFO[0000] Created network 'k3d-k3s-default'            
INFO[0000] Created image volume k3d-k3s-default-images  
INFO[0000] Starting new tools node...                   
INFO[0000] Starting Node 'k3d-k3s-default-tools'        
INFO[0001] Creating node 'k3d-k3s-default-server-0'     
INFO[0001] Creating LoadBalancer 'k3d-k3s-default-serverlb' 
INFO[0001] Using the k3d-tools node to gather environment information 
INFO[0001] HostIP: using network gateway 10.89.0.1 address 
INFO[0001] Starting cluster 'k3s-default'               
INFO[0001] Starting servers...                          
INFO[0001] Starting Node 'k3d-k3s-default-server-0'     
INFO[0005] All agents already running.                  
INFO[0005] Starting helpers...                          
INFO[0005] Starting Node 'k3d-k3s-default-serverlb'     
INFO[0012] Injecting records for hostAliases (incl. host.k3d.internal) and for 2 network members into CoreDNS configmap... 
INFO[0014] Cluster 'k3s-default' created successfully!  
INFO[0014] You can now use it like this:                
kubectl cluster-info

Although the initial cluster creation is successful, I noticed that the k3d-k3s-default-server-0 was actually having issues staying up, unfortunately. There are some hints in the log about what the kubelet is not happy about:

E0615 19:04:02.504643       2 container_manager_linux.go:457] "Updating kernel flag failed (Hint: enable KubeletInUserNamespace feature flag to ignore the error)" err="open /proc/sys/kernel/panic: permission denied" flag="kernel/panic"
E0615 19:04:02.504726       2 container_manager_linux.go:457] "Updating kernel flag failed (Hint: enable KubeletInUserNamespace feature flag to ignore the error)" err="open /proc/sys/kernel/panic_on_oops: permission denied" flag="kernel/panic_on_oops"
E0615 19:04:02.504878       2 container_manager_linux.go:457] "Updating kernel flag failed (Hint: enable KubeletInUserNamespace feature flag to ignore the error)" err="open /proc/sys/vm/overcommit_memory: permission denied" flag="vm/overcommit_memory"
E0615 19:04:02.504972       2 kubelet.go:1431] "Failed to start ContainerManager" err="[open /proc/sys/kernel/panic: permission denied, open /proc/sys/kernel/panic_on_oops: permission denied, open /proc/sys/vm/overcommit_memory: permission denied]"

So without giving it too much thought I recreated the cluster like so:

$ k3d cluster create --k3s-arg '--kubelet-arg=feature-gates=KubeletInUserNamespace=true@server:*'

Seems OK but haven't dug much deeper to verify:

$ kubectl get nodes -o wide
NAME                       STATUS   ROLES                  AGE   VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE   KERNEL-VERSION            CONTAINER-RUNTIME
k3d-k3s-default-server-0   Ready    control-plane,master   55s   v1.23.6+k3s1   10.89.0.2     <none>        K3s dev    5.17.13-300.fc36.x86_64   containerd://1.5.11-k3s2

$ kubectl get pods -A
NAMESPACE     NAME                                      READY   STATUS      RESTARTS   AGE
kube-system   local-path-provisioner-6c79684f77-wv88d   1/1     Running     0          2m24s
kube-system   coredns-d76bd69b-rbsw2                    1/1     Running     0          2m24s
kube-system   helm-install-traefik-crd-c64r4            0/1     Completed   0          2m24s
kube-system   metrics-server-7cd5fcb6b7-qfclf           1/1     Running     0          2m24s
kube-system   helm-install-traefik-w74nc                0/1     Completed   2          2m24s
kube-system   svclb-traefik-bgcz8                       2/2     Running     0          104s
kube-system   traefik-df4ff85d6-xf2nm                   1/1     Running     0          104s

I hope this helps!

Cheers,

hadrabap commented 1 year ago

@radikaled Thanks a lot!

I've been facing same issue on Oracle Linux 8 with CGroupsV2 and rootless podman. The following command helped:

k3d cluster create --k3s-arg '--kubelet-arg=feature-gates=KubeletInUserNamespace=true@server:*'

It might be cool k3d manages this automatically or at least prints a hint if rootless environment is detected.

almereyda commented 1 year ago

How could a rootless environment be detected?

hadrabap commented 1 year ago

Well, assuming rootless environment is defined as an environment running under non-root user on host level and container's root user is mapped to the host level (non-root) user, the check should be the user Id of the current process (on host level — the k3d itself) is not 0.

This is (modified) example how I detect root mode in my utility (C++):

#include <unistd.h>

………

const auto uid = getuid();
if (uid > 0) {
    // root-less mode
} else {
    // root-full mode
}

Running k3d inside container should be another exercise — I don't know if it is even supported feature.

To detect the CGroupV1 vs CGroupV2 is more tricky. I have two Oracle Linux 8 systems here. Oracle Linux 8 is capable of running in both modes but the CGroup V1 is the default. The easiest way to check in what version the system currently runs is by checking mounted filesystem name:

CGroup V1:

[opc@ipa ~]$ stat -fc %T /sys/fs/cgroup/
tmpfs

CGroup V2:

[opc@sws ~]$ stat -fc %T /sys/fs/cgroup/
cgroup2fs

If the result for the stat command is cgroup2fs then the system runs in CGroup V2 mode. Otherwise CGroup V1.

P.S.: Please, excuse me if I miss some crucial points here. I'm really new to this kind of stuff.

iosipeld commented 11 months ago

I had the same issue on Debian 11 today on Alibaba Cloud instance.

I added following lines to /etc/default/grub under GRUB_CMDLINE_LINUX variable

 cgroup_memory=1 cgroup_enable=memory

and rebooted the instance from console. Now error gone and systemd service starts correctly.

Utopiah commented 8 months ago

on Fedora 36 a non-root user does not have the cpuset delegation by default

same on bookworm/sid but following https://rootlesscontaine.rs/getting-started/common/cgroup2/#enabling-cpu-cpuset-and-io-delegation indeed fixed it for me too

omyhub commented 6 months ago

[root@localhost ~]# cat /etc/systemd/system/user@.service.d/delegate.conf [Service] Delegate=cpu cpuset io memory pids

[admin@localhost ~]$ cat /sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers cpuset io memory pids

[admin@localhost ~]$ stat -fc %T /sys/fs/cgroup/ cgroup2fs

[root@localhost ~]# docker logs -f k3d-k3s-default-server-0 ...... time="2024-03-08T06:42:47.390547887Z" level=fatal msg="failed to find cpu cgroup (v2)"

help pls!

omyhub commented 6 months ago

if the os is redhat os like and u have the same problem , u can visite the link below https://access.redhat.com/solutions/6582021 https://access.redhat.com/solutions/737243 https://support.hpe.com/hpesc/public/docDisplay?docId=sf000082729en_us&docLocale=en_US&page=index.html sovle : disable rtkit-daemon