kubernetes-sigs / kind

Kubernetes IN Docker - local clusters for testing Kubernetes
https://kind.sigs.k8s.io/
Apache License 2.0
13.57k stars 1.57k forks source link

Cgroups changes to run `kind` on Guix System #3363

Closed worldofgeese closed 1 year ago

worldofgeese commented 1 year ago

I have created a package definition for running kind on Guix System, which does not use systemd. I have all the other requirements running but I'd like for kind to create a cluster without checking for Delegate=yes, which is a systemd-only step. Unfortunately, I'm unable to bypass:

KIND_EXPERIMENTAL_PROVIDER=podman kind create cluster
using podman due to KIND_EXPERIMENTAL_PROVIDER
enabling experimental podman provider
ERROR: failed to create cluster: running kind with rootless provider requires setting systemd property "Delegate=yes", see https://kind.sigs.k8s.io/docs/user/rootless/
BenTheElder commented 1 year ago

[...] but I'd like for kind to create a cluster without checking for Delegate=yes, which is a systemd-only step.

So what actually happens here is we're checking what container features are reported supported by podman/docker and detecting that this must not be set if they're missing.

The code is here: https://github.com/kubernetes-sigs/kind/blob/ac28d7fb19b4f353369d889b3900a7a9dd46f4c1/pkg/cluster/internal/create/create.go#L252-L254

While the error message is systemd-oriented because that's what is well supported in the ecosystem, the check isn't actually systemd aware and the problem isn't systemd specific either, just the recommended fix.

BenTheElder commented 1 year ago

I do not recommend running containers on other init systems, the init system and container runtime should ideally cooperate to manage cgroups. Guix sheperd is realistically not tested or integrated with by any of the ecosystem container tools.

You might be able to resolve this issue, but there are probably more. xref #3277 (openrc instead)

worldofgeese commented 1 year ago

@BenTheElder

Interesting! And thank you for the detailed response. Guix developers have done a lot of work to get Podman running (and rootlessly). I'll start a conversation with them and see what plumbing needs to be done and if it's something I'm capable of implementing.

For what it's worth, we do have cgroupsv2 working and as far as I can tell, I am limiting all required values in /etc/cgconfig.conf

group kind {
    perm {
        admin {
            uid = worldofgeese;
        }
        task {
            uid = worldofgeese;
        }
    }

    cpuset {
        cpuset.mems="0";
        cpuset.cpus="0-5";
    }
    memory {
        memory.limit_in_bytes = 5000000000;
    }
    cpu {
        cpu.shares = 1024;
    }
    pids {
        pids.max = 1000;
    }
}

cgroups.controllers shows I should have access to these

cpuset cpu io memory hugetlb pids misc

Result of podman info:

```yaml host: arch: amd64 buildahVersion: 1.29.0 cgroupControllers: [] cgroupManager: cgroupfs cgroupVersion: v2 conmon: package: Unknown path: /gnu/store/iw3y9shdnjaalg737kps82z8pnhzwf8j-conmon-2.0.31/bin/conmon version: 'conmon version 2.0.31, commit: unknown' cpuUtilization: idlePercent: 90.67 systemPercent: 2.64 userPercent: 6.68 cpus: 8 distribution: distribution: guix version: unknown eventLogger: file hostname: mahakala idMappings: gidmap: - container_id: 0 host_id: 998 size: 1 - container_id: 1 host_id: 100000 size: 65536 uidmap: - container_id: 0 host_id: 1000 size: 1 - container_id: 1 host_id: 100000 size: 65536 kernel: 6.4.15 linkmode: dynamic logDriver: k8s-file memFree: 11793453056 memTotal: 16411496448 networkBackend: cni ociRuntime: name: crun package: Unknown path: /gnu/store/ayy02ajfq9hyyp02ilg77qlmddc1kdj9-crun-1.4.5/bin/crun version: |- crun version UNKNOWN commit: c381048530aa750495cf502ddb7181f2ded5b400 spec: 1.0.0 +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL os: linux remoteSocket: path: /run/user/1000/podman/podman.sock security: apparmorEnabled: false capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID rootless: true seccompEnabled: true seccompProfilePath: "" selinuxEnabled: false serviceIsRemote: false slirp4netns: executable: /gnu/store/4krcv5r0p0s1i1svaqp2mk1120iizc9k-slirp4netns-1.2.0/bin/slirp4netns package: Unknown version: |- slirp4netns version 1.2.0 commit: unknown libslirp: 4.7.0 SLIRP_CONFIG_VERSION_MAX: 4 libseccomp: 2.5.4 swapFree: 0 swapTotal: 0 uptime: 0h 3m 55.00s plugins: authorization: null log: - k8s-file - none - passthrough network: - bridge - macvlan - ipvlan volume: - local registries: {} store: configFile: /home/worldofgeese/.config/containers/storage.conf containerStore: number: 0 paused: 0 running: 0 stopped: 0 graphDriverName: vfs graphOptions: {} graphRoot: /home/worldofgeese/.local/share/containers/storage graphRootAllocated: 502607552512 graphRootUsed: 287098028032 graphStatus: {} imageCopyTmpDir: /var/tmp imageStore: number: 235 runRoot: /run/user/1000/containers transientStore: false volumePath: /home/worldofgeese/.local/share/containers/storage/volumes version: APIVersion: 4.4.1 Built: 1 BuiltTime: Thu Jan 1 01:00:01 1970 GitCommit: "" GoVersion: go1.19.7 Os: linux OsArch: linux/amd64 Version: 4.4.1 ```
BenTheElder commented 1 year ago

As-is I don't really have the bandwidth to debug podman on systemd hosts, but the code is here:

https://github.com/kubernetes-sigs/kind/blob/ac28d7fb19b4f353369d889b3900a7a9dd46f4c1/pkg/cluster/internal/providers/podman/provider.go#L423

The problem is your podman info is not reporting the availability of these controllers, in fact it's reporting none:

cgroupControllers: []

worldofgeese commented 1 year ago

I don't expect anyone to respond: I just wanted to wrap my investigation in anticipation of eventually bringing this to Guix System's mailing list.

I was able to force detection of these cgroup features by running, echo "+cpu +cpuset +memory +pids" >> /sys/fs/cgroup/cgroup.subtree_control at which point podman info reports

  cgroupControllers:
  - cpuset
  - cpu
  - memory
  - pids

Creating a cluster still eluded me, as we can see in the below error.

KIND_EXPERIMENTAL_PROVIDER=podman kind create cluster --retain
using podman due to KIND_EXPERIMENTAL_PROVIDER
enabling experimental podman provider
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 ✗ Preparing nodes 📦  
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"

In plain English, the error states neither Reached target .*Multi-User System. (indicating a successful cluster creation) or detected cgroup v1 could be found in the control plane's logs. Here's the function in kind's code that returns the error message: https://github.com/kubernetes-sigs/kind/blob/ac28d7fb19b4f353369d889b3900a7a9dd46f4c1/pkg/cluster/internal/providers/common/cgroups.go#L44.

The function that calls it waits for 30 seconds for either message to appear

https://github.com/kubernetes-sigs/kind/blob/ac28d7fb19b4f353369d889b3900a7a9dd46f4c1/pkg/cluster/internal/providers/podman/provision.go#L427-436

Logs from the control plane below:

→ podman logs kind-control-plane 
INFO: running in a user namespace (experimental)
INFO: ensuring we can execute mount/umount even with userns-remap
INFO: remounting /sys read-only
mount: /sys: permission denied.
INFO: UserNS: ignoring mount fail
INFO: making mounts shared
INFO: detected cgroup v2
INFO: clearing and regenerating /etc/machine-id
Initializing machine ID from random generator.
INFO: faking /sys/class/dmi/id/product_name to be "kind"
INFO: faking /sys/class/dmi/id/product_uuid to be random
INFO: faking /sys/devices/virtual/dmi/id/product_uuid as well
INFO: setting iptables to detected mode: legacy
INFO: detected IPv4 address: 10.89.0.5
INFO: detected IPv6 address: fc00:f853:ccd:e793::5
INFO: starting init
Failed to look up module alias 'autofs4': Function not implemented
systemd 247.3-7+deb11u2 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified)
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 11 (bullseye)!

Set hostname to <kind-control-plane>.
Failed to create /init.scope control group: Permission denied
Failed to allocate manager object: Permission denied
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

My conjecture is kind is just too tightly wound up in systemd making it difficult to work around these issues. Thanks again, Ben, for hopping in to discuss, even with limited time. I appreciate you!

worldofgeese commented 1 year ago

I was able to get kind working finally by taking the following steps:

  1. Grant delegated limits to the cgroups subtree controller: echo "+cpu +cpuset +memory +pids" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
  2. Change the entire cgroups ownership tree to the users group: g=users && sudo chgrp -R ${g} /sys/fs/cgroup/
  3. Change the entire cgroups ownership tree to my user: u=$USER && sudo chown -R ${u}: /sys/fs/cgroup
  4. Set the following in my Guix's system.scm:
    ;; Rootless Podman requires the next 5 services
    ;; we're using the iptables service purely to make its resources available to minikube and kind
    (service iptables-service-type
             (iptables-configuration
              (ipv4-rules (plain-file "iptables.rules" "*filter
    :INPUT ACCEPT
    :FORWARD ACCEPT
    :OUTPUT ACCEPT
    COMMIT
    "))
              (ipv6-rules (plain-file "ip6tables.rules" "*filter
    :INPUT ACCEPT
    :FORWARD ACCEPT
    :OUTPUT ACCEPT
    COMMIT
    "))))
    (simple-service 'etc-subuid etc-service-type
                    (list `("subuid" ,(plain-file "subuid" (string-append "root:0:65536\n" username ":100000:65536\n")))))
    (simple-service 'etc-subgid etc-service-type
                    (list `("subgid" ,(plain-file "subgid" (string-append "root:0:65536\n" username ":100000:65536\n")))))
    (service pam-limits-service-type
             (list
              (pam-limits-entry "*" 'both 'nofile 100000)))
    (simple-service 'etc-container-policy etc-service-type
                    (list `("containers/policy.json", (plain-file "policy.json" "{\"default\": [{\"type\": \"insecureAcceptAnything\"}]}"))))
    %my-services
  5. Running sudo guix system reconfigure then restarting my system
  6. Creating the cluster with KIND_EXPERIMENTAL_PROVIDER=podman kind create cluster --retain