Closed rubin55 closed 12 months ago
E0922 15:41:33.987497 10642 kubelet.go:1480] "Failed to start ContainerManager" err="failed to get rootfs info: unable to find data in memory cache"
The crash is coming from the kubelet, not any of the code here in this project. Specifically, it seems to be caused by cadvisor not being able to retrieve information on the kubelet's root dir at /var/lib/kubelet
. Is there anything odd about the filesystem or mount points on this node?
@brandond no, this VM is like our other nodes simply an Alpine 3.18 with about 100GB virtual disk, ext4 formatted. The setup of the VM is identical to at least 5 other VMs that run on KVM. It might be a wild goose chase, but it's the only significant difference. We set these machines up with Ansible, no manual work, so I'm pretty sure about the no-differences in configuration of Alpine, Docker and K3S part.
I don't think there's anything we can do on our side to fix this; the error is deep in the kubelet cadvisor stuff which we do not modify in any way.
Do you run into this same error if you use containerd instead of docker? Why are you tied to using docker?
The issue upstream that seems to collect this is https://github.com/kubernetes/kubernetes/issues/113066
We're using Docker on these hosts to interface with docker command-line clients. The hosts are essentially docker/k3s development nodes.
I've seen similar errors from containerd when it has just started up and hasn't collected filesystem stats yet, although those aren't usually fatal.
I don't know if docker works similarly, although docker is at this point just a wrapper around containerd anyway... but you might see if waiting 10-15 seconds after docker starts up allows the kubelet to start without running into this error?
I've tried various timing things, unfortunately, it never recovers, not even if I wait a few minutes for docker to idle before starting the install. I just did a fully clean install, so formatted disk -> alpine -> docker -> k3s and I'm still getting exactly the same error.
I initially thought this was isolated/related to Hyper-V. Just to be sure I just did a full re-installation on a KVM host also and I'm getting the same error, so Hyper-V is not related to this.
This is really weird as I did exactly these steps on september 19th and everything worked.
A bit at a loss what to try next.
This is probably the root cause of this: https://github.com/kubernetes/kubernetes/issues/120813
Update: yep, verified. All working installs were on kernel 6.1.53. Kernel 6.1.54 came out the same day I was doing those setups, but was not synced to my mirror yet. All failing systems, regardless of underlying hypervisor are on kernel 6.1.54.
Huh. I'm honestly surprised to find a modern distro, on a new kernel, still using cgroup v1. Can you switch to v2?
@brandond how would I do that? I seem to support both:
# grep cgroup /proc/filesystems
nodev cgroup
nodev cgroup2
That's called hybrid, and it doesn't work great either. Try https://liet.me/2022/07/08/enable-cgroups-v2-in-alpine-linux/ ?
@brandond I didn't have a moment to check that cgoup v2 settting, but I did upgrade to Linux kernel 6.1.55 which just came out and fixes the problem.
Environmental Info: K3s Version:
Node(s) CPU architecture, OS, and Version:
Cluster Configuration: Single node.
Describe the bug: Install k3s as usual, we use this configuration on many, many nodes. For some unknown reason this single node with an identical Alpine 3.18 + Docker setup fails with an error in
k3s.log
stating: "Failed to start ContainerManager" err="failed to get rootfs info: unable to find data in memory cache"Steps To Reproduce:
curl -sfL https://get.k3s.io | sh -s - --docker --kube-apiserver-arg enable-aggregator-routing=true --kubelet-arg container-log-max-files=2 --kubelet-arg container-log-max-size=-1 --disable traefik
Expected behavior: Like our other Alpine + Docker machines, I expect K3S to come up without the above error and be available.
Actual behavior: It will keep crashing in a loop. I've collected one iteration of the crash loop logs:
Additional context / logs: N/A.