Closed tallaxes closed 2 years ago
Can you try this on a more recent release of RKE2? The most recent release on each branch should now lease-lock all images imported from tarballs on startup. Ref:
NOTE: The lease-locking only remains present if the image is imported during the most recent startup. If you import the images, then delete the tarball, and restart RKE2; the lease will be cleared and the image may get garbage collected.
Also, note that --pod-infra-container-image
is a docker-specific flag and cannot be used with containerd, which is why we don't use it.
--pod-infra-container-image string Default: k8s.gcr.io/pause:3.5 Specified image will not be pruned by the image garbage collector. When container-runtime is set to docker, all containers in each pod will use the network/IPC namespaces from this image. Other CRI implementations have their own configuration to set this image.
Instead, we pass it into the containerd config here: https://github.com/k3s-io/k3s/blob/master/pkg/agent/templates/templates_linux.go#L28
I did test on latest version of RKE2 as well, and observed the same behavior. (Will double check ...)
I don't expect containerd's lease-locking is going to help here. It prevents containerd from doing GC, but not kubelet. Lease-locking would not prevent an explicit image deletion via crictl rmi
, would it? That's what kubelet is doing; here translating into this CRI call.
--pod-infra-container-image
appears to be multiple things. The documentation is correct, I think, if a little confusing. Here is the way I parse it:
Specified image will not be pruned by the image garbage collector.
Check. That's what this issue is about ...
When container-runtime is set to docker, all containers in each pod will use the network/IPC namespaces from this image. Other CRI implementations have their own configuration to set this image.
Check. That's why K3s needs to configure containerd explicitly elsewhere ...
Retested on latest version, same problem (that one is not air-gapped, can re-pull):
# rke2 -v
rke2 version v1.21.4+rke2r1 (edc8a09018f21b08e305243c1622a8032acdf3c5)
go version go1.16.6b7
# crictl info --output go-template --template "{{.config.sandboxImage}}"
index.docker.io/rancher/pause:3.5
# crictl images | grep pause
docker.io/rancher/pause 3.5 69bfc1b271447 299kB
# df -h /var/lib
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 30G 12G 19G 39% /
# fallocate -l 17G /var/lib/bigfile
# df -h /var/lib
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 30G 29G 1.5G 96% /
# crictl images | grep pause
docker.io/rancher/pause 3.5 69bfc1b271447 299kB
# sleep 5m
# crictl images | grep pause
# # gone
# crictl pull index.docker.io/rancher/pause:3.5
Image is up to date for sha256:69bfc1b271447c9b5c1c5dc26e3163a524b7dab051626bed18b9f366c0fa764a
# crictl images | grep pause
docker.io/rancher/pause 3.5 69bfc1b271447 299kB
# sleep 5m
# crictl images | grep pause
# # gone
Hmm, if lease locking doesn't prevent GC then I'm not sure what it's good for.
I am not that familiar with containerd GC, but from reading the docs it seems to be working on a different level, so there is likely more than one "GC" going on. The one of interest in this case is kubelet's container image GC. Here is evidence it is kubelet that is explicitly removing sandbox/pause image as part of GC:
# cat >> /etc/rancher/rke2/config.yaml <<EOF
kubelet-arg:
- "v=5"
EOF
# systemctl restart rke2-server
# df -h /var
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 30G 8.9G 22G 30% /
# fallocate -l 19G /var/lib/bigfile
# crictl images | grep pause
docker.io/rancher/pause 3.5 69bfc1b271447 299kB
# tail -f /var/lib/rancher/rke2/agent/logs/kubelet.log | grep 69bfc1b271447
I0917 03:00:41.154419 1847818 image_gc_manager.go:241] "Adding image ID to currentImages" imageID="sha256:69bfc1b271447c9b5c1c5dc26e3163a524b7dab051626bed18b9f366c0fa764a"
I0917 03:00:41.154424 1847818 image_gc_manager.go:258] "Image ID has size" imageID="sha256:69bfc1b271447c9b5c1c5dc26e3163a524b7dab051626bed18b9f366c0fa764a" size=299461
I0917 03:00:41.154478 1847818 image_gc_manager.go:359] "Evaluating image ID for possible garbage collection" imageID="sha256:69bfc1b271447c9b5c1c5dc26e3163a524b7dab051626bed18b9f366c0fa764a"
I0917 03:00:41.154485 1847818 image_gc_manager.go:375] "Removing image to free bytes" imageID="sha256:69bfc1b271447c9b5c1c5dc26e3163a524b7dab051626bed18b9f366c0fa764a" size=299461
^C
# crictl images | grep pause
# # gone
Yeah but that just marks it as unused by the kubelet, containerd does the garbage collection separately so it should not be possible to actually remove it while it's locked by a lease. It's possible that the tags are not locked though, just the image layers. Unfortunately the kubelet itself does not have a way to protect critical images from GC so we were hoping to do it at a lower level.
Ok, maybe I need to do some more testing, now that (I think) I understand better what lease-locking is intended to achieve.
It seems lock-leasing is only used when images are loaded from tar, so some of the above tests that were not made in airgap (with images not loaded from tar) do not apply. Though I am still wondering about implications of pause image being GC'ed (& re-pulled) in that case; it seems it would still be good to have kubelet ignore that image altogether ...
And for airgap deployment, we do currently remove image tars after first startup (leftover from this issue), so, based on your note, any restart of RKE2 would clear the lease. So it seems some more careful testing is in order - maybe augmented with boltbrowser to verify the leases.
Yeah, if you can test with the tars in place that would be good.
Also, according to the comment at https://github.com/kubernetes/kubernetes/issues/81756#issuecomment-523710417 we can use that flag to protect the image from kubelet GC even with containers, so maybe there's still an improvement to be made there. Doesn't solve the GC problem for other images though.
There's also this, which I hope will land on 1.23: https://github.com/kubernetes/kubernetes/pull/103299
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.
Please reopen, this bug still happens
Second that.
Please confirm you're using a recent version of k3s; we have been passing --pod-infra-container-image to the kubelet to the kubelet to prevent gc for ages, and upstream issues about the image getting pruned anyway should be resolved.
Environmental Info: RKE2 Version:
rke2 version v1.20.8+rke2r1 (53564f7b22271ced120480dd5f4c9d76d14ed2d3) go version go1.15.8b5
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
single node
Describe the bug:
Sandbox (pause) image - if different from built-in kubelet default (and in RKE2 it is always different, due to rancher repo) - is always garbage collected by kubelet under disk pressure conditions; and in case there is nowhere to pull it from (unusual, but possible in airgap) no new pods can be started.
It looks like kubelet recognizes the sandbox image for ignoring by GC based on name/tag, with defaults hard-coded. And, being a special image, I suspect it evades kubelet's "is it used by any pod?" check. The default name for sandbox image could be overridden via
--pod-infra-container-image
arg, but RKE2 currently does not do it.The relevant behavior comes from K3S, so K3S should also be affected - but I only tested on RKE2.
K3S (and presumably RKE2) does appear to propagate
pause-image
setting (if any) into kubelet--pod-infra-container-image
, but only if runtime is not remote (here) - which is never the case (at least not under RKE2). This also means that settingpause-image
explicitly does not help.Setting kubelet
--pod-infra-container-image
explicitly in RKE2 config does work:Of course the image has to match the one used by RKE2 (later versions use pause:3.5) - which is why it is probably best to have RKE2 (K3S?) set it automatically ...
Steps To Reproduce:
systemctl restart rke2-server tail -f /var/lib/rancher/rke2/agent/logs/kubelet.log | grep image_gc_manager
Expected behavior:
[Custom] Sandbox (pause) image is never garbage collected
Actual behavior:
[Custom] Sandbox (pause) image is always garbage collected under disk pressure conditions; and in case there is nowhere to pull it from (unusual, but possible in airgap) no new pods can be started.