k0sproject / k0s

k0s - The Zero Friction Kubernetes
https://docs.k0sproject.io
Other
3.82k stars 368 forks source link

cgroups inheritance when using k0s in docker #4234

Open turdusmerula opened 7 months ago

turdusmerula commented 7 months ago

Before creating an issue, make sure you've checked the following:

Platform

Linux 6.5.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2 x86_64 GNU/Linux
NAME="Linux Mint"
VERSION="21.3 (Virginia)"
ID=linuxmint
ID_LIKE="ubuntu debian"
PRETTY_NAME="Linux Mint 21.3"
VERSION_ID="21.3"
HOME_URL="https://www.linuxmint.com/"
SUPPORT_URL="https://forums.linuxmint.com/"
BUG_REPORT_URL="http://linuxmint-troubleshooting-guide.readthedocs.io/en/latest/"
PRIVACY_POLICY_URL="https://www.linuxmint.com/"
VERSION_CODENAME=virginia
UBUNTU_CODENAME=jammy

Version

v1.29.2+k0s.0

Sysinfo

`k0s sysinfo`
Total memory: 62.5 GiB (pass)
Disk space available for /var/lib/k0s: 188.3 GiB (pass)
Name resolution: localhost: [127.0.0.1 ::1] (pass)
Operating system: Linux (pass)
  Linux kernel release: 6.5.0-26-generic (pass)
  Max. file descriptors per process: current: 1048576 / max: 1048576 (pass)
  AppArmor: unavailable (pass)
  Executable in PATH: modprobe: /sbin/modprobe (pass)
  Executable in PATH: mount: /bin/mount (pass)
  Executable in PATH: umount: /bin/umount (pass)
  /proc file system: mounted (0x9fa0) (pass)
  Control Groups: version 2 (pass)
    cgroup controller "cpu": available (is a listed root controller) (pass)
    cgroup controller "cpuacct": available (via cpu in version 2) (pass)
    cgroup controller "cpuset": available (is a listed root controller) (pass)
    cgroup controller "memory": available (is a listed root controller) (pass)
    cgroup controller "devices": available (device filters attachable) (pass)
    cgroup controller "freezer": available (cgroup.freeze exists) (pass)
    cgroup controller "pids": available (is a listed root controller) (pass)
    cgroup controller "hugetlb": available (is a listed root controller) (pass)
    cgroup controller "blkio": available (via io in version 2) (pass)
  CONFIG_CGROUPS: Control Group support: no kernel config found (warning)
  CONFIG_NAMESPACES: Namespaces support: no kernel config found (warning)
  CONFIG_NET: Networking support: no kernel config found (warning)
  CONFIG_EXT4_FS: The Extended 4 (ext4) filesystem: no kernel config found (warning)
  CONFIG_PROC_FS: /proc file system support: no kernel config found (warning)

What happened?

I use the k0sproject/k0s:v1.29.2-k0s.0 docker image to run k0s with the following command:

export n=1
docker run -d --privileged --name="test$n-k0s" --memory=4G --cgroupns="host" --cgroup-parent="test$n-k0s.slice" -v=/var/lib/k0s k0sproject/k0s:v1.29.2-k0s.0 k0s controller --enable-worker --no-taints

The goal is to be able to launch several instances in parallel, this works fine.

The problem I'm facing is with the cgroups. K0s runs correctly inside the container cgroup scope so the 4GB memory barrier works correctly. But if I look to the processes spawned by the containerd-shim they are launched in /kubepods so they are not constrained.

Screenshot at 2024-04-03 19-58-53

Is there a way to have the cgroup '/kubepods` created inside my container cgroup? I don't quite know if it is a bug, a lack of configuration on my side or if it's a feature request, any help would be really helpful :)

Steps to reproduce

1. 2. 3.

Expected behavior

No response

Actual behavior

No response

Screenshots and logs

No response

Additional context

No response

twz123 commented 7 months ago

Your observations are indeed correct. The current way the "k0s in Docker" docs are written are not optimized for running multiple workers on the same Docker host. In particular, the steps for cgroupsv2 weaken the isolation between the host and the k0s container quite a bit.

The culprit here is that certain things related to cgroups need to be in place for kubelet and the container runtime to be happy, such as a writable cgroup root filesystem with all the necessary controllers enabled. While this can be achieved with some shenanigans like a clever Docker container entrypoint script, k0s doesn't have that support right now. You can try to work around this by giving each k0s worker Docker container some different values for the various cgroup-related kubelet configuration options: Try adding these args to each of your k0s worker's kubelet extraArgs and experiment with the outcome:

--cgroup-root=/test.slice/test$n-k0s.slice
--kubelet-cgroups=/test.slice/test$n-k0s.slice/kubelet.slice
--runtime-cgroups=/test.slice/test$n-k0s.slice/containerd.slice

I took a stab at the Docker entrypoint script a few months ago, but haven't polished it up for a PR yet. That might provide some additional insight.

turdusmerula commented 7 months ago

Thank you for this answers @twz123, this is pretty much what I managed to implement, however it kind of feels hacky.

I start by running the container waiting for its configuration file. During this operation docker will create the test$n.slice cgroup

docker run -it --name test$n -d --cgroupns=host --cgroup-parent=test$n.slice --hostname k0s --privileged -v /var/lib/k0s -v "/sys/fs/cgroup:/sys/fs/cgroup:rw" k0sproject/k0s:v1.29.2-k0s.0 bash -c 'while [[ ! -f /var/lib/k0s/config.yaml ]]; do sleep 1; done; k0s controller --enable-worker --no-taints --config /var/lib/k0s/config.yaml --profile=cgroup --enable-metrics-scraper'

While the container is waiting I then set the limits inside the cgroup (doing it with docker allows me to do it without sudo):

docker exec -it test$n bash -c "echo 6000M > /sys/fs/cgroup/test$n.slice/memory.max"
docker exec -it test$n bash -c "echo 5500M > /sys/fs/cgroup/test$n.slice/memory.high"
docker exec -it test$n bash -c "echo 0 > /sys/fs/cgroup/test$n.slice/memory.swap.max"
docker exec -it test$n bash -c "echo 0 > /sys/fs/cgroup/test$n.slice/memory.swap.high"
docker exec -it test$n bash -c "echo '0-4' > /sys/fs/cgroup/test$n.slice/cpuset.cpus"

I then construct the config.yaml and push it to the container:

apiVersion: k0s.k0sproject.io/v1beta1
kind: ClusterConfig
metadata:
  name: k0s
spec:
  api:
    extraArgs:
      # allow the cluster to expose on localhost
      service-node-port-range: 80-32767
  telemetry:
    enabled: false

  workerProfiles:
  # https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/
  # https://github.com/k0sproject/k0s/blob/main/docs/configuration.md
  - name: cgroup
    values:
      cgroupRoot: test$n.slice
      systemCgroups: test$n.slice
      kubeletCgroups: test$n.slice
docker cp config.yaml test$n:/var/lib/k0s/config.yaml

Processes are now in the correct cgroup (instead ot kubelet): Screenshot at 2024-04-09 09-56-13

Memory limits and cpu works, I can see that if I constrain too tight the cluster it won't start and won't swap as I was expecting. The limit of this approach which is still unsolved for now is that the oom killer does not work. I suspect that having the kubelet outside of the cgroup is the reason why, it's not aware of the memory limits by now. The stranger part is that the kernel oom killer does not work either, when my cluster saturates it's memory it goes in a strange state where it loads it's cpus without crashing any process, I still have to investigate on this point.

twz123 commented 7 months ago

Thanks for experimenting and sharing the results @turdusmerula! For historic reasons, k0s will disregard the kubeletCgroups field in the worker profile. This is probably something that should be fixed. However, It should work if you useit as a kubelet argument via k0s worker --kubelet-extra-args=--kubelet-cgroups=/test.slice/test$n-k0s.slice/kubelet.slice.

turdusmerula commented 7 months ago

Kubelet tells me the --kubelet-cgroup is deprecated when I dig into its help:

      --kubelet-cgroups string                                   Optional absolute name of cgroups to create and run the Kubelet in. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)

That's why I used the kubeletCgroups field, do you confirm I should not set it through the profile?

twz123 commented 7 months ago

do you confirm I should not set it through the profile?

Yes, the flags are deprecated, but k0s will currently ignore the kubeletCgroups field in the worker profile, of all things. Until this is fixed, the deprecated kubelet flag is a backdoor.

turdusmerula commented 7 months ago

Have you already managed to use the --kubelet-extra-args parameter ? No matter what I try to pass, even junk parameters, it does not seem to be passed to the kubelet at the end, I find no trace of kubelet evaluating any extra arg in the container logs.

turdusmerula commented 7 months ago

Everything is working now, I had been quite unlucky. I choose to override the kubelet configuration by passing it in a file called /var/lib/k0s/kubelet-config.yaml and took me a while to figure out that this path was already chosen by k0s to generate the config it passes to kubelet so my file was replaced and had no effect.

However I confirm that passing parameters through --kubelet-extra-args has no effect as they are overwritten by the kubelet-config.yaml generated by k0s. The only way I could overcome this was by setting --kubelet-extra-args=--config=/var/lib/k0s/kubelet-ext-config.yaml and passing in it my config:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

# default values coming from /var/lib/k0s/kubelet-config.yaml and created by k0s
authentication:
  anonymous: {}
  webhook:
    cacheTTL: 0s
  x509:
    clientCAFile: /var/lib/k0s/pki/ca.crt
authorization:
  webhook:
    cacheAuthorizedTTL: 0s
    cacheUnauthorizedTTL: 0s
cgroupsPerQOS: true
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local
containerRuntimeEndpoint: unix:///run/k0s/containerd.sock
cpuManagerReconcilePeriod: 0s
eventRecordQPS: 0
evictionPressureTransitionPeriod: 0s
failSwapOn: false
fileCheckFrequency: 0s
httpCheckFrequency: 0s
imageMaximumGCAge: 0s
imageMinimumGCAge: 0s
kubeReservedCgroup: system.slice
logging:
  flushFrequency: 0
  options:
    json:
      infoBufferSize: "0"
  verbosity: 0
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
resolvConf: /etc/resolv.conf
rotateCertificates: true
runtimeRequestTimeout: 0s
serverTLSBootstrap: true
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
tlsCipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
tlsMinVersion: VersionTLS12
volumePluginDir: /usr/libexec/k0s/kubelet-plugins/volume/exec
volumeStatsAggPeriod: 0s

# cgroups configuration
kubeletCgroups: "/test.slice/kubelet"
systemCgroups: "/test.slice/system"
cgroupRoot: "/test.slice"

I think there is probably something prone to improvement, the way I have to do this feels way too hacky for now.

github-actions[bot] commented 6 months ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 5 months ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 4 months ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 3 months ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 2 months ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 1 month ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 1 day ago

The issue is marked as stale since no activity has been recorded in 30 days

twz123 commented 11 hours ago

The issues with cgroups in the docker docs and entrypoint have been addressed in #5263.