k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
26.82k stars 2.26k forks source link

containerd to use the systemd driver #5454

Closed gfrankliu closed 2 years ago

gfrankliu commented 2 years ago

Quote from containerd cgroup driver doc

While containerd and Kubernetes use the legacy cgroupfs driver for managing cgroups by default, it is recommended to use the systemd driver on systemd-based hosts

On Debian 11, does k3s installation default to use systemd driver for containerd? If not, how to configure that?

brandond commented 2 years ago

Check out the related discussion at https://github.com/rancher/rke2/discussions/2710

gfrankliu commented 2 years ago

Moving the discussion from https://github.com/rancher/rke2/discussions/2710 back here.

Here is how to reproduce the issue:

On a clean Debian 11, install k3s using systemd cgroups driver using below command:

curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644 --disable traefik --kubelet-arg cgroup-driver=systemd

The cluster comes up fine, but the default pods in kube-system namespace keeps restarting, and eventually in the CrashLoopBackOff state in a few minutes:

$ kubectl get pod -A
NAMESPACE     NAME                                      READY   STATUS    RESTARTS      AGE
kube-system   coredns-96cc4f57d-rgk92                   1/1     Running   0             2m53s
kube-system   local-path-provisioner-84bb864455-jql9b   1/1     Running   2 (28s ago)   2m53s
kube-system   metrics-server-ff9dbcb6c-zfhq9            1/1     Running   3 (23s ago)   2m53s
$ kubectl get pod -A
NAMESPACE     NAME                                      READY   STATUS             RESTARTS      AGE
kube-system   coredns-96cc4f57d-rgk92                   1/1     Running            0             7m5s
kube-system   local-path-provisioner-84bb864455-jql9b   0/1     CrashLoopBackOff   3 (38s ago)   7m5s
kube-system   metrics-server-ff9dbcb6c-zfhq9            0/1     CrashLoopBackOff   5 (35s ago)   7m5s
brandond commented 2 years ago

As I mentioned in the other discussion, any logs or describe output showing why these pods are crashing would be helpful. Just showing that they are crashing isn't really enough to work with.

gfrankliu commented 2 years ago

Sorry I thought the reproduce steps are good enough, but here are the logs. k3s.txt

gfrankliu commented 2 years ago

I think I found the issue. I need to create config.toml.tmpl and add

  [plugins.cri.containerd.runtimes.runc.options]
    SystemdCgroup = true

Back to original feature request, it would be great if k3s installation can have a new flag for enabling systemd cgroup driver that can take care of those manual changes in config.toml and kubelet-arg.

gfrankliu commented 2 years ago

I do encounter an issue when using config.toml.tmpl. In order to use /etc/rancher/k3s/registries.yaml, I added below to the config.toml.tmpl:

{{ if .PrivateRegistryConfig }}
{{ if .PrivateRegistryConfig.Mirrors }}
[plugins.cri.registry.mirrors]{{end}}
{{range $k, $v := .PrivateRegistryConfig.Mirrors }}
[plugins.cri.registry.mirrors."{{$k}}"]
  endpoint = [{{range $i, $j := $v.Endpoints}}{{if $i}}, {{end}}{{printf "%q" .}}{{end}}]
{{end}}

{{range $k, $v := .PrivateRegistryConfig.Configs }}
{{ if $v.Auth }}
[plugins.cri.registry.configs."{{$k}}".auth]
  {{ if $v.Auth.Username }}username = "{{ $v.Auth.Username }}"{{end}}
  {{ if $v.Auth.Password }}password = "{{ $v.Auth.Password }}"{{end}}
  {{ if $v.Auth.Auth }}auth = "{{ $v.Auth.Auth }}"{{end}}
  {{ if $v.Auth.IdentityToken }}identitytoken = "{{ $v.Auth.IdentityToken }}"{{end}}
{{end}}
{{ if $v.TLS }}
[plugins.cri.registry.configs."{{$k}}".tls]
  {{ if $v.TLS.CAFile }}ca_file = "{{ $v.TLS.CAFile }}"{{end}}
  {{ if $v.TLS.CertFile }}cert_file = "{{ $v.TLS.CertFile }}"{{end}}
  {{ if $v.TLS.KeyFile }}key_file = "{{ $v.TLS.KeyFile }}"{{end}}
{{end}}
{{end}}
{{end}}

The quotes and backslashes in the Password from registries.yaml aren't being escaped. Without using my own tmpl, k3s can properly escape the passwords from registries.yaml when creating the config.toml. How to properly fix the tmpl so that the Password can be escaped?

brandond commented 2 years ago

Are you using the template from the source code as a starting point?

gfrankliu commented 2 years ago

That worked, thanks!

The link from https://rancher.com/docs/k3s/latest/en/advanced/ goes to a different template.

brandond commented 2 years ago

yeah, the template got split out into platform-specific files a while back; the docs just haven't been updated.

brandond commented 2 years ago

This turned out to be pretty easy to handle; with any luck the May releases will use the systemd cgroup driver automatically when possible.

dereknola commented 2 years ago

Updating the docs https://github.com/rancher/docs/pull/4042

galal-hussein commented 2 years ago

Assigning myself to work on the backporting PRs

ShylajaDevadiga commented 2 years ago

I was able to reproduce the issue on k3s v1.23.6+k3s1 on Debian 11 on Linode using the steps to reproduce

# kubectl get pods -A -w
NAMESPACE     NAME                                      READY   STATUS    RESTARTS      AGE
kube-system   metrics-server-7cd5fcb6b7-9452v           1/1     Running   0             2m14s
kube-system   local-path-provisioner-6c79684f77-hfqnj   1/1     Running   1 (87s ago)   2m14s
kube-system   coredns-d76bd69b-f76mc                    1/1     Running   1 (26s ago)   2m14s
kube-system   local-path-provisioner-6c79684f77-hfqnj   0/1     Error     1 (101s ago)   2m28s
kube-system   local-path-provisioner-6c79684f77-hfqnj   0/1     CrashLoopBackOff   1 (2s ago)     2m29s
kube-system   local-path-provisioner-6c79684f77-hfqnj   1/1     Running            2 (19s ago)    2m46s
kube-system   coredns-d76bd69b-f76mc                    0/1     Completed          1              2m55s
kube-system   coredns-d76bd69b-f76mc                    0/1     CrashLoopBackOff   1 (2s ago)     2m56s

Validated fix on k3s v1.23.7-rc1+k3s1

# sudo cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml |tail -10
    SystemdCgroup = true

# kubectl get pods -A 
NAMESPACE     NAME                                      READY   STATUS    RESTARTS   AGE
kube-system   local-path-provisioner-6c79684f77-pvcgn   1/1     Running   0          7m44s
kube-system   coredns-d76bd69b-5z8np                    1/1     Running   0          7m44s
kube-system   metrics-server-7cd5fcb6b7-lxfbx           1/1     Running   0          7m44s
gfrankliu commented 1 year ago

I finally got a chance to try out fresh installing the latest stable to confirm the fix on Debian 11 which has systemd 247:

$ curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644                                                                        
[INFO]  Finding release for channel stable
[INFO]  Using v1.23.8+k3s1 as release

/var/lib/rancher/k3s/agent/etc/containerd/config.toml still shows

[plugins.cri.containerd.runtimes.runc.options]
    SystemdCgroup = false

I thought the installer will auto detect and set SystemdCgroup to true?

brandond commented 1 year ago

It will if systemd is compatible. The check is gated on k3s running under systemd, and the cpuset cgroup being available. Are both of these true?

gfrankliu commented 1 year ago

This is on a clean installation of Debian 11

koi@test-debian11:~$ apt list --installed | grep systemd

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libpam-systemd/stable,now 247.3-7 amd64 [installed,automatic]
libsystemd0/stable,now 247.3-7 amd64 [installed]
systemd-sysv/stable,now 247.3-7 amd64 [installed]
systemd/stable,now 247.3-7 amd64 [installed]
koi@test-debian11:~$ cat /proc/$$/cpuset
/user.slice
brandond commented 1 year ago

That doesn't tell me whether or not the cpuset cgroup controller is delegated to the k3s service. When k3s is running, what's the output of cat /sys/fs/cgroup/system.slice/k3s.service/cgroup.controllers ?

gfrankliu commented 1 year ago

$ cat /sys/fs/cgroup/system.slice/k3s.service/cgroup.controllers 
cpuset cpu io memory hugetlb pids rdma
gfrankliu commented 1 year ago

Another observation on Debian 11: there is no directory /sys/fs/cgroup/cpuset so I can't see any files in it, but if I manually create that directory, all files are immediately auto populated:

koi@test-debian11:~$ ls -l /sys/fs/cgroup/cpuset
ls: cannot access '/sys/fs/cgroup/cpuset': No such file or directory
koi@test-debian11:~$ 
koi@test-debian11:~$ sudo mkdir /sys/fs/cgroup/cpuset
koi@test-debian11:~$ ls -l /sys/fs/cgroup/cpuset
total 0
-r--r--r-- 1 root root 0 Jul 12 01:11 cgroup.controllers
-r--r--r-- 1 root root 0 Jul 12 01:11 cgroup.events
-rw-r--r-- 1 root root 0 Jul 12 01:11 cgroup.freeze
-rw-r--r-- 1 root root 0 Jul 12 01:11 cgroup.max.depth
-rw-r--r-- 1 root root 0 Jul 12 01:11 cgroup.max.descendants
-rw-r--r-- 1 root root 0 Jul 12 01:11 cgroup.procs
-r--r--r-- 1 root root 0 Jul 12 01:11 cgroup.stat
-rw-r--r-- 1 root root 0 Jul 12 01:11 cgroup.subtree_control
-rw-r--r-- 1 root root 0 Jul 12 01:11 cgroup.threads
-rw-r--r-- 1 root root 0 Jul 12 01:11 cgroup.type
-rw-r--r-- 1 root root 0 Jul 12 01:11 cpu.max
-rw-r--r-- 1 root root 0 Jul 12 01:11 cpu.pressure
-rw-r--r-- 1 root root 0 Jul 12 01:11 cpuset.cpus
-r--r--r-- 1 root root 0 Jul 12 01:11 cpuset.cpus.effective
-rw-r--r-- 1 root root 0 Jul 12 01:11 cpuset.cpus.partition
-rw-r--r-- 1 root root 0 Jul 12 01:11 cpuset.mems
-r--r--r-- 1 root root 0 Jul 12 01:11 cpuset.mems.effective
-r--r--r-- 1 root root 0 Jul 12 01:11 cpu.stat
-rw-r--r-- 1 root root 0 Jul 12 01:11 cpu.weight
-rw-r--r-- 1 root root 0 Jul 12 01:11 cpu.weight.nice
-r--r--r-- 1 root root 0 Jul 12 01:11 hugetlb.1GB.current
-r--r--r-- 1 root root 0 Jul 12 01:11 hugetlb.1GB.events
-r--r--r-- 1 root root 0 Jul 12 01:11 hugetlb.1GB.events.local
-rw-r--r-- 1 root root 0 Jul 12 01:11 hugetlb.1GB.max
-r--r--r-- 1 root root 0 Jul 12 01:11 hugetlb.1GB.rsvd.current
-rw-r--r-- 1 root root 0 Jul 12 01:11 hugetlb.1GB.rsvd.max
-r--r--r-- 1 root root 0 Jul 12 01:11 hugetlb.2MB.current
-r--r--r-- 1 root root 0 Jul 12 01:11 hugetlb.2MB.events
-r--r--r-- 1 root root 0 Jul 12 01:11 hugetlb.2MB.events.local
-rw-r--r-- 1 root root 0 Jul 12 01:11 hugetlb.2MB.max
-r--r--r-- 1 root root 0 Jul 12 01:11 hugetlb.2MB.rsvd.current
-rw-r--r-- 1 root root 0 Jul 12 01:11 hugetlb.2MB.rsvd.max
-rw-r--r-- 1 root root 0 Jul 12 01:11 io.max
-rw-r--r-- 1 root root 0 Jul 12 01:11 io.pressure
-r--r--r-- 1 root root 0 Jul 12 01:11 io.stat
-rw-r--r-- 1 root root 0 Jul 12 01:11 io.weight
-r--r--r-- 1 root root 0 Jul 12 01:11 memory.current
-r--r--r-- 1 root root 0 Jul 12 01:11 memory.events
-r--r--r-- 1 root root 0 Jul 12 01:11 memory.events.local
-rw-r--r-- 1 root root 0 Jul 12 01:11 memory.high
-rw-r--r-- 1 root root 0 Jul 12 01:11 memory.low
-rw-r--r-- 1 root root 0 Jul 12 01:11 memory.max
-rw-r--r-- 1 root root 0 Jul 12 01:11 memory.min
-r--r--r-- 1 root root 0 Jul 12 01:11 memory.numa_stat
-rw-r--r-- 1 root root 0 Jul 12 01:11 memory.oom.group
-rw-r--r-- 1 root root 0 Jul 12 01:11 memory.pressure
-r--r--r-- 1 root root 0 Jul 12 01:11 memory.stat
-r--r--r-- 1 root root 0 Jul 12 01:11 memory.swap.current
-r--r--r-- 1 root root 0 Jul 12 01:11 memory.swap.events
-rw-r--r-- 1 root root 0 Jul 12 01:11 memory.swap.high
-rw-r--r-- 1 root root 0 Jul 12 01:11 memory.swap.max
-r--r--r-- 1 root root 0 Jul 12 01:11 pids.current
-r--r--r-- 1 root root 0 Jul 12 01:11 pids.events
-rw-r--r-- 1 root root 0 Jul 12 01:11 pids.max
-r--r--r-- 1 root root 0 Jul 12 01:11 rdma.current
-rw-r--r-- 1 root root 0 Jul 12 01:11 rdma.max
koi@test-debian11:~$ 
brandond commented 1 year ago

Hmm, that's odd. Something doesn't sound quite right. Do you have cgroup v2 enabled, or is this v1 or hybrid? grep cgroup /proc/mounts should indicate.

gfrankliu commented 1 year ago

I have a fresh installation of Debian 11, which defaults to cgroups v2 based on the release notes . This could be disabled by kernel command line but I didn't do that.

koi@test-debian11:~$ cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-5.10.0-15-amd64 root=UUID=c30de53c-9fe1-4889-928a-48db7891cac4 ro quiet
koi@test-debian11:~$ grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
koi@test-debian11:~$ 

Maybe k3s check code doesn't work on Debian 11 ?

gfrankliu commented 1 year ago

Here is how I did the k3s installation. Do I need to give any special options to help installer use cgroup v2?

$ curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644
brandond commented 1 year ago

The only requirements for autodetecting this should be that you have the cpuset cgroup controller available, and that the Type in the systemd unit is set to notify (which should be the default), and results in enabling the systemd notification socket so that we know we're running under systemd. See if this makes any difference?

curl -sfL https://get.k3s.io | INSTALL_K3S_TYPE=notify sh -s - --write-kubeconfig-mode 644

gfrankliu commented 1 year ago

Tried curl -sfL https://get.k3s.io | INSTALL_K3S_TYPE=notify sh -s - --write-kubeconfig-mode 644 and it doesn't make a difference, still sees SystemdCgroup = false in /var/lib/rancher/k3s/agent/etc/containerd/config.toml Can you try it on a fresh Debian 11 and see if you can reproduce?

gfrankliu commented 1 year ago

I do see cpuset cgroup controller

koi@test-debian11:~$ cat /sys/fs/cgroup/cgroup.controllers 
cpuset cpu io memory hugetlb pids rdma

BTW, I see below message in syslog, is it normal?

Jul 12 22:20:03 test-debian11 k3s[15318]: W0712 22:20:03.658470   15318 manager.go:159] Cannot detect current cgroup on cgroup v2
brandond commented 1 year ago

Hmm. Apologies, it looks like this was regressed by https://github.com/k3s-io/k3s/commit/a9b5a1933fb. On servers, the NOTIFY_SOCKET environment variable gets unset, which prevents the cgroup detection code from detecting that it is running under systemd.

You can test the fix on your node with: curl -sfL https://get.k3s.io | INSTALL_K3S_TYPE=notify INSTALL_K3S_COMMIT=a2a5e79335c4a8c4d3f0038818ac0ef8b8403464 sh -s - --write-kubeconfig-mode 644

gfrankliu commented 1 year ago

Tried fresh installation using curl -sfL https://get.k3s.io | INSTALL_K3S_TYPE=notify INSTALL_K3S_COMMIT=a2a5e79335c4a8c4d3f0038818ac0ef8b8403464 sh -s - --write-kubeconfig-mode 644 but the /var/lib/rancher/k3s/agent/etc/containerd/config.toml still shows SystemdCgroup = false

brandond commented 1 year ago

Did the install actually work? I don't believe CI is done yet for you to be able to install that commit. Try again shortly, and restart k3s after the install is successful.

I tested it from a local build and it does work.

gfrankliu commented 1 year ago

Installation did actually work:

$ curl -sfL https://get.k3s.io | INSTALL_K3S_TYPE=notify INSTALL_K3S_COMMIT=a2a5e79335c4a8c4d3f0038818ac0ef8b8403464 sh -s - --write-kubeconfig-mode 644
[INFO]  Using commit a2a5e79335c4a8c4d3f0038818ac0ef8b8403464 as release
[INFO]  Downloading hash https://storage.googleapis.com/k3s-ci-builds/k3s-a2a5e79335c4a8c4d3f0038818ac0ef8b8403464.sha256sum
[INFO]  Downloading binary https://storage.googleapis.com/k3s-ci-builds/k3s-a2a5e79335c4a8c4d3f0038818ac0ef8b8403464
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping installation of SELinux RPM
[INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[INFO]  Creating /usr/local/bin/crictl symlink to k3s
[INFO]  Creating /usr/local/bin/ctr symlink to k3s
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service -> /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s

Restarting didn't help. still shows SystemdCgroup = false.

Just tried fresh install again, no difference.

brandond commented 1 year ago

What do you get from:

gfrankliu commented 1 year ago
koi@test-debian11:~$ cat /sys/fs/cgroup/system.slice/k3s.service/cgroup.controllers
cpuset cpu io memory pids
koi@test-debian11:~$ sudo cat /proc/$(pgrep k3s)/environ | tr \\0 \\n | grep .
PATH=/var/lib/rancher/k3s/data/b1e4965bdf8b3b6087405f65958941d8de1e3cf92d70313b9f44b0dbe07c3001/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/var/lib/rancher/k3s/data/b1e4965bdf8b3b6087405f65958941d8de1e3cf92d70313b9f44b0dbe07c3001/bin/aux
NOTIFY_SOCKET=/run/systemd/notify
INVOCATION_ID=9b421358b85246919cd915f7e2dcc3b7
JOURNAL_STREAM=8:1294321
RES_OPTIONS= 
K3S_DATA_DIR=/var/lib/rancher/k3s/data/b1e4965bdf8b3b6087405f65958941d8de1e3cf92d70313b9f44b0dbe07c3001
koi@test-debian11:~$ ls -l /run/systemd/notify
srwxrwxrwx 1 root root 0 Jul 12 04:57 /run/systemd/notify
koi@test-debian11:~$ 
brandond commented 1 year ago

Hmm, so on your node systemd is not setting SYSTEMD_EXEC_PID... I guess that wasn't added until v248-2 in March of 2021, but you're still on 247. Maybe INVOCATION_ID is more reliable, that's been around since 232. It'd be nice if the docs said when those vars were added.

Try with c40c3620b77fd65aceea5188b547c987a5f7840f

gfrankliu commented 1 year ago

v248 is very new. Debian 11 uses 247, Ubuntu 20.04 uses 245 and Redhat 8 uses 239.

gfrankliu commented 1 year ago

c40c3620b77fd65aceea5188b547c987a5f7840f works. Thanks for taking time to fix this!

brandond commented 1 year ago

Yeah, I'm on Ubuntu 22.04 which has 249. Glad that commit works for you! Today is upstream release day, so that commit won't make it into K3s until next month's releases.