Sandbox (pause) image always gets garbage collected (under disk pressure)

tallaxes commented 3 years ago

Environmental Info: RKE2 Version:

rke2 version v1.20.8+rke2r1 (53564f7b22271ced120480dd5f4c9d76d14ed2d3) go version go1.15.8b5

Node(s) CPU architecture, OS, and Version:

Linux <snip> 4.18.0-305.12.1.el8_4.x86_64 #1 SMP Mon Jul 26 08:06:24 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux`

Cluster Configuration:

single node

Describe the bug:

Sandbox (pause) image - if different from built-in kubelet default (and in RKE2 it is always different, due to rancher repo) - is always garbage collected by kubelet under disk pressure conditions; and in case there is nowhere to pull it from (unusual, but possible in airgap) no new pods can be started.

It looks like kubelet recognizes the sandbox image for ignoring by GC based on name/tag, with defaults hard-coded. And, being a special image, I suspect it evades kubelet's "is it used by any pod?" check. The default name for sandbox image could be overridden via --pod-infra-container-image arg, but RKE2 currently does not do it.

The relevant behavior comes from K3S, so K3S should also be affected - but I only tested on RKE2.

K3S (and presumably RKE2) does appear to propagate pause-image setting (if any) into kubelet --pod-infra-container-image, but only if runtime is not remote (here) - which is never the case (at least not under RKE2). This also means that setting pause-image explicitly does not help.

Setting kubelet --pod-infra-container-image explicitly in RKE2 config does work:

kubelet-arg:
- "pod-infra-container-image=docker.io/rancher/pause:3.2"

Of course the image has to match the one used by RKE2 (later versions use pause:3.5) - which is why it is probably best to have RKE2 (K3S?) set it automatically ...

Steps To Reproduce:

Installed RKE2 in airgap (and without access to any registry)
Optional: Add -v 5 to kubelet args and watch kubelet log for image_gc_manager (or, better, for sha256 of the pause image):
```
cat >> /etc/rancher/rke2/config.yaml <<EOF
kubelet-arg:
```
"v=5" EOF

systemctl restart rke2-server tail -f /var/lib/rancher/rke2/agent/logs/kubelet.log | grep image_gc_manager


- Force kubelet GC and check for pause image:

```log
# crictl info | grep sandbox
    "sandboxImage": "index.docker.io/rancher/pause:3.2",
# crictl images | grep pause
docker.io/rancher/pause                                       3.2                              e004ddc1b078f       686kB
# df -h /var/lib
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2   30G   11G   20G  35% /
# fallocate -l 18G /var/lib/bigfile   # target >85% use, the default kubelet GC threshold
# df -h /var/lib
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2   30G   29G  1.7G  95% /
# sleep 5m
# crictl images | grep pause  # gone
#

Expected behavior:

[Custom] Sandbox (pause) image is never garbage collected

Actual behavior:

[Custom] Sandbox (pause) image is always garbage collected under disk pressure conditions; and in case there is nowhere to pull it from (unusual, but possible in airgap) no new pods can be started.

brandond commented 3 years ago

Can you try this on a more recent release of RKE2? The most recent release on each branch should now lease-lock all images imported from tarballs on startup. Ref:

https://github.com/rancher/rke2/issues/1779#issuecomment-915666373

NOTE: The lease-locking only remains present if the image is imported during the most recent startup. If you import the images, then delete the tarball, and restart RKE2; the lease will be cleared and the image may get garbage collected.

brandond commented 3 years ago

Also, note that --pod-infra-container-image is a docker-specific flag and cannot be used with containerd, which is why we don't use it.

--pod-infra-container-image string Default: k8s.gcr.io/pause:3.5 Specified image will not be pruned by the image garbage collector. When container-runtime is set to docker, all containers in each pod will use the network/IPC namespaces from this image. Other CRI implementations have their own configuration to set this image.

Instead, we pass it into the containerd config here: https://github.com/k3s-io/k3s/blob/master/pkg/agent/templates/templates_linux.go#L28

tallaxes commented 3 years ago

I did test on latest version of RKE2 as well, and observed the same behavior. (Will double check ...)

I don't expect containerd's lease-locking is going to help here. It prevents containerd from doing GC, but not kubelet. Lease-locking would not prevent an explicit image deletion via crictl rmi, would it? That's what kubelet is doing; here translating into this CRI call.

--pod-infra-container-image appears to be multiple things. The documentation is correct, I think, if a little confusing. Here is the way I parse it:

Specified image will not be pruned by the image garbage collector.

Check. That's what this issue is about ...

When container-runtime is set to docker, all containers in each pod will use the network/IPC namespaces from this image. Other CRI implementations have their own configuration to set this image.

Check. That's why K3s needs to configure containerd explicitly elsewhere ...

tallaxes commented 3 years ago

Retested on latest version, same problem (that one is not air-gapped, can re-pull):

# rke2 -v
rke2 version v1.21.4+rke2r1 (edc8a09018f21b08e305243c1622a8032acdf3c5)
go version go1.16.6b7
# crictl info --output go-template --template "{{.config.sandboxImage}}"
index.docker.io/rancher/pause:3.5
# crictl images | grep pause
docker.io/rancher/pause                         3.5                              69bfc1b271447       299kB
# df -h /var/lib
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2   30G   12G   19G  39% /
# fallocate -l 17G /var/lib/bigfile
# df -h /var/lib
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2   30G   29G  1.5G  96% /
# crictl images | grep pause
docker.io/rancher/pause                         3.5                              69bfc1b271447       299kB
# sleep 5m
# crictl images | grep pause
# # gone

# crictl pull index.docker.io/rancher/pause:3.5
Image is up to date for sha256:69bfc1b271447c9b5c1c5dc26e3163a524b7dab051626bed18b9f366c0fa764a
# crictl images | grep pause
docker.io/rancher/pause                         3.5                              69bfc1b271447       299kB
# sleep 5m
# crictl images | grep pause
# # gone

brandond commented 3 years ago

Hmm, if lease locking doesn't prevent GC then I'm not sure what it's good for.

tallaxes commented 3 years ago

I am not that familiar with containerd GC, but from reading the docs it seems to be working on a different level, so there is likely more than one "GC" going on. The one of interest in this case is kubelet's container image GC. Here is evidence it is kubelet that is explicitly removing sandbox/pause image as part of GC:

# cat >> /etc/rancher/rke2/config.yaml <<EOF
kubelet-arg:
- "v=5"
EOF
# systemctl restart rke2-server
# df -h /var
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2   30G  8.9G   22G  30% /
# fallocate -l 19G /var/lib/bigfile
# crictl images | grep pause
docker.io/rancher/pause                         3.5                              69bfc1b271447       299kB
# tail -f /var/lib/rancher/rke2/agent/logs/kubelet.log | grep 69bfc1b271447
I0917 03:00:41.154419 1847818 image_gc_manager.go:241] "Adding image ID to currentImages" imageID="sha256:69bfc1b271447c9b5c1c5dc26e3163a524b7dab051626bed18b9f366c0fa764a"
I0917 03:00:41.154424 1847818 image_gc_manager.go:258] "Image ID has size" imageID="sha256:69bfc1b271447c9b5c1c5dc26e3163a524b7dab051626bed18b9f366c0fa764a" size=299461
I0917 03:00:41.154478 1847818 image_gc_manager.go:359] "Evaluating image ID for possible garbage collection" imageID="sha256:69bfc1b271447c9b5c1c5dc26e3163a524b7dab051626bed18b9f366c0fa764a"
I0917 03:00:41.154485 1847818 image_gc_manager.go:375] "Removing image to free bytes" imageID="sha256:69bfc1b271447c9b5c1c5dc26e3163a524b7dab051626bed18b9f366c0fa764a" size=299461
^C
# crictl images | grep pause
# # gone

brandond commented 3 years ago

Yeah but that just marks it as unused by the kubelet, containerd does the garbage collection separately so it should not be possible to actually remove it while it's locked by a lease. It's possible that the tags are not locked though, just the image layers. Unfortunately the kubelet itself does not have a way to protect critical images from GC so we were hoping to do it at a lower level.

tallaxes commented 3 years ago

Ok, maybe I need to do some more testing, now that (I think) I understand better what lease-locking is intended to achieve.

It seems lock-leasing is only used when images are loaded from tar, so some of the above tests that were not made in airgap (with images not loaded from tar) do not apply. Though I am still wondering about implications of pause image being GC'ed (& re-pulled) in that case; it seems it would still be good to have kubelet ignore that image altogether ...

And for airgap deployment, we do currently remove image tars after first startup (leftover from this issue), so, based on your note, any restart of RKE2 would clear the lease. So it seems some more careful testing is in order - maybe augmented with boltbrowser to verify the leases.

brandond commented 3 years ago

Yeah, if you can test with the tars in place that would be good.

Also, according to the comment at https://github.com/kubernetes/kubernetes/issues/81756#issuecomment-523710417 we can use that flag to protect the image from kubelet GC even with containers, so maybe there's still an improvement to be made there. Doesn't solve the GC problem for other images though.

brandond commented 3 years ago

There's also this, which I hope will land on 1.23: https://github.com/kubernetes/kubernetes/pull/103299

stale[bot] commented 2 years ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

yardenshoham commented 1 year ago

Please reopen, this bug still happens

sdemura commented 1 year ago

Second that.

brandond commented 1 year ago

Please confirm you're using a recent version of k3s; we have been passing --pod-infra-container-image to the kubelet to the kubelet to prevent gc for ages, and upstream issues about the image getting pruned anyway should be resolved.

rancher / rke2

Sandbox (pause) image always gets garbage collected (under disk pressure) #1830