K3S on Alpine 3.18 using Docker crashing after install: failed to get rootfs info: unable to find data in memory cache

rubin55 commented 12 months ago

Environmental Info: K3s Version:

k3s version v1.27.6+k3s1 (bd04941a)
go version go1.20.8

Node(s) CPU architecture, OS, and Version:

Linux MARVEL 6.1.54-0-lts #1-Alpine SMP PREEMPT_DYNAMIC Tue, 19 Sep 2023 13:49:36 +0000 x86_64 Linux

Cluster Configuration: Single node.

Describe the bug: Install k3s as usual, we use this configuration on many, many nodes. For some unknown reason this single node with an identical Alpine 3.18 + Docker setup fails with an error in k3s.log stating: "Failed to start ContainerManager" err="failed to get rootfs info: unable to find data in memory cache"

Steps To Reproduce:

curl -sfL https://get.k3s.io | sh -s - --docker --kube-apiserver-arg enable-aggregator-routing=true --kubelet-arg container-log-max-files=2 --kubelet-arg container-log-max-size=-1 --disable traefik
cat /var/log/k3s.log

Expected behavior: Like our other Alpine + Docker machines, I expect K3S to come up without the above error and be available.

Actual behavior: It will keep crashing in a loop. I've collected one iteration of the crash loop logs:

time="2023-09-22T15:41:32+02:00" level=info msg="Starting k3s v1.27.6+k3s1 (bd04941a)"
time="2023-09-22T15:41:32+02:00" level=info msg="Configuring sqlite3 database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
time="2023-09-22T15:41:32+02:00" level=info msg="Configuring database table schema and indexes, this may take a moment..."
time="2023-09-22T15:41:32+02:00" level=info msg="Database tables and indexes are up to date"
time="2023-09-22T15:41:32+02:00" level=info msg="Kine available at unix://kine.sock"
time="2023-09-22T15:41:32+02:00" level=info msg="Reconciling bootstrap data between datastore and disk"
time="2023-09-22T15:41:32+02:00" level=info msg="Running kube-apiserver --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=https://kubernetes.default.svc.cluster.local,k3s --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --egress-selector-config-file=/var/lib/rancher/k3s/server/etc/egress-selector-config.yaml --enable-admission-plugins=NodeRestriction --enable-aggregator-routing=true --enable-bootstrap-token-auth=true --etcd-servers=unix://kine.sock --feature-gates=JobTrackingWithFinalizers=true --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --profiling=false --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.current.key --service-cluster-ip-range=10.43.0.0/16 --service-node-port-range=30000-32767 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
time="2023-09-22T15:41:32+02:00" level=info msg="Running kube-scheduler --authentication-kubeconfig=/var/lib/rancher/k3s/server/cred/scheduler.kubeconfig --authorization-kubeconfig=/var/lib/rancher/k3s/server/cred/scheduler.kubeconfig --bind-address=127.0.0.1 --kubeconfig=/var/lib/rancher/k3s/server/cred/scheduler.kubeconfig --leader-elect=false --profiling=false --secure-port=10259"
W0922 15:41:32.314380   10642 feature_gate.go:241] Setting GA feature gate JobTrackingWithFinalizers=true. It will be removed in a future release.
time="2023-09-22T15:41:32+02:00" level=info msg="Waiting for API server to become available"
time="2023-09-22T15:41:32+02:00" level=info msg="Running kube-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/var/lib/rancher/k3s/server/cred/controller.kubeconfig --authorization-kubeconfig=/var/lib/rancher/k3s/server/cred/controller.kubeconfig --bind-address=127.0.0.1 --cluster-cidr=10.42.0.0/16 --cluster-signing-kube-apiserver-client-cert-file=/var/lib/rancher/k3s/server/tls/client-ca.nochain.crt --cluster-signing-kube-apiserver-client-key-file=/var/lib/rancher/k3s/server/tls/client-ca.key --cluster-signing-kubelet-client-cert-file=/var/lib/rancher/k3s/server/tls/client-ca.nochain.crt --cluster-signing-kubelet-client-key-file=/var/lib/rancher/k3s/server/tls/client-ca.key --cluster-signing-kubelet-serving-cert-file=/var/lib/rancher/k3s/server/tls/server-ca.nochain.crt --cluster-signing-kubelet-serving-key-file=/var/lib/rancher/k3s/server/tls/server-ca.key --cluster-signing-legacy-unknown-cert-file=/var/lib/rancher/k3s/server/tls/server-ca.nochain.crt --cluster-signing-legacy-unknown-key-file=/var/lib/rancher/k3s/server/tls/server-ca.key --configure-cloud-routes=false --controllers=*,tokencleaner,-service,-route,-cloud-node-lifecycle --feature-gates=JobTrackingWithFinalizers=true --kubeconfig=/var/lib/rancher/k3s/server/cred/controller.kubeconfig --leader-elect=false --profiling=false --root-ca-file=/var/lib/rancher/k3s/server/tls/server-ca.crt --secure-port=10257 --service-account-private-key-file=/var/lib/rancher/k3s/server/tls/service.current.key --service-cluster-ip-range=10.43.0.0/16 --use-service-account-credentials=true"
I0922 15:41:32.314670   10642 server.go:567] external host was not specified, using 172.17.2.16
time="2023-09-22T15:41:32+02:00" level=info msg="Running cloud-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/var/lib/rancher/k3s/server/cred/cloud-controller.kubeconfig --authorization-kubeconfig=/var/lib/rancher/k3s/server/cred/cloud-controller.kubeconfig --bind-address=127.0.0.1 --cloud-config=/var/lib/rancher/k3s/server/etc/cloud-config.yaml --cloud-provider=k3s --cluster-cidr=10.42.0.0/16 --configure-cloud-routes=false --controllers=*,-route --kubeconfig=/var/lib/rancher/k3s/server/cred/cloud-controller.kubeconfig --leader-elect=false --leader-elect-resource-name=k3s-cloud-controller-manager --node-status-update-frequency=1m0s --profiling=false"
time="2023-09-22T15:41:32+02:00" level=info msg="Server node token is available at /var/lib/rancher/k3s/server/token"
I0922 15:41:32.315087   10642 server.go:174] Version: v1.27.6+k3s1
I0922 15:41:32.315114   10642 server.go:176] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
time="2023-09-22T15:41:32+02:00" level=info msg="To join server node to cluster: k3s server -s https://172.17.2.16:6443 -t ${SERVER_NODE_TOKEN}"
time="2023-09-22T15:41:32+02:00" level=info msg="Agent node token is available at /var/lib/rancher/k3s/server/agent-token"
time="2023-09-22T15:41:32+02:00" level=info msg="To join agent node to cluster: k3s agent -s https://172.17.2.16:6443 -t ${AGENT_NODE_TOKEN}"
time="2023-09-22T15:41:32+02:00" level=info msg="Wrote kubeconfig /etc/rancher/k3s/k3s.yaml"
time="2023-09-22T15:41:32+02:00" level=info msg="Run: k3s kubectl"
I0922 15:41:32.324000   10642 shared_informer.go:311] Waiting for caches to sync for node_authorizer
I0922 15:41:32.329386   10642 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I0922 15:41:32.329398   10642 plugins.go:161] Loaded 13 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,ClusterTrustBundleAttest,CertificateSubjectRestriction,ValidatingAdmissionPolicy,ValidatingAdmissionWebhook,ResourceQuota.
I0922 15:41:32.336327   10642 handler.go:232] Adding GroupVersion apiextensions.k8s.io v1 to ResourceManager
W0922 15:41:32.336351   10642 genericapiserver.go:752] Skipping API apiextensions.k8s.io/v1beta1 because it has no resources.
I0922 15:41:32.336817   10642 instance.go:282] Using reconciler: lease
I0922 15:41:32.362307   10642 handler.go:232] Adding GroupVersion  v1 to ResourceManager
I0922 15:41:32.362443   10642 instance.go:651] API group "internal.apiserver.k8s.io" is not enabled, skipping.
I0922 15:41:32.437086   10642 instance.go:651] API group "resource.k8s.io" is not enabled, skipping.
I0922 15:41:32.441241   10642 handler.go:232] Adding GroupVersion authentication.k8s.io v1 to ResourceManager
W0922 15:41:32.441255   10642 genericapiserver.go:752] Skipping API authentication.k8s.io/v1beta1 because it has no resources.
W0922 15:41:32.441260   10642 genericapiserver.go:752] Skipping API authentication.k8s.io/v1alpha1 because it has no resources.
I0922 15:41:32.442416   10642 handler.go:232] Adding GroupVersion authorization.k8s.io v1 to ResourceManager
W0922 15:41:32.442428   10642 genericapiserver.go:752] Skipping API authorization.k8s.io/v1beta1 because it has no resources.
I0922 15:41:32.443598   10642 handler.go:232] Adding GroupVersion autoscaling v2 to ResourceManager
I0922 15:41:32.443916   10642 handler.go:232] Adding GroupVersion autoscaling v1 to ResourceManager
W0922 15:41:32.443924   10642 genericapiserver.go:752] Skipping API autoscaling/v2beta1 because it has no resources.
W0922 15:41:32.443927   10642 genericapiserver.go:752] Skipping API autoscaling/v2beta2 because it has no resources.
I0922 15:41:32.445148   10642 handler.go:232] Adding GroupVersion batch v1 to ResourceManager
W0922 15:41:32.445162   10642 genericapiserver.go:752] Skipping API batch/v1beta1 because it has no resources.
I0922 15:41:32.446144   10642 handler.go:232] Adding GroupVersion certificates.k8s.io v1 to ResourceManager
W0922 15:41:32.446152   10642 genericapiserver.go:752] Skipping API certificates.k8s.io/v1beta1 because it has no resources.
W0922 15:41:32.446155   10642 genericapiserver.go:752] Skipping API certificates.k8s.io/v1alpha1 because it has no resources.
I0922 15:41:32.447092   10642 handler.go:232] Adding GroupVersion coordination.k8s.io v1 to ResourceManager
W0922 15:41:32.447100   10642 genericapiserver.go:752] Skipping API coordination.k8s.io/v1beta1 because it has no resources.
W0922 15:41:32.447750   10642 genericapiserver.go:752] Skipping API discovery.k8s.io/v1beta1 because it has no resources.
I0922 15:41:32.447957   10642 handler.go:232] Adding GroupVersion discovery.k8s.io v1 to ResourceManager
I0922 15:41:32.449265   10642 handler.go:232] Adding GroupVersion networking.k8s.io v1 to ResourceManager
W0922 15:41:32.449274   10642 genericapiserver.go:752] Skipping API networking.k8s.io/v1beta1 because it has no resources.
W0922 15:41:32.449277   10642 genericapiserver.go:752] Skipping API networking.k8s.io/v1alpha1 because it has no resources.
I0922 15:41:32.450081   10642 handler.go:232] Adding GroupVersion node.k8s.io v1 to ResourceManager
W0922 15:41:32.450088   10642 genericapiserver.go:752] Skipping API node.k8s.io/v1beta1 because it has no resources.
W0922 15:41:32.450091   10642 genericapiserver.go:752] Skipping API node.k8s.io/v1alpha1 because it has no resources.
I0922 15:41:32.451041   10642 handler.go:232] Adding GroupVersion policy v1 to ResourceManager
W0922 15:41:32.451048   10642 genericapiserver.go:752] Skipping API policy/v1beta1 because it has no resources.
I0922 15:41:32.452337   10642 handler.go:232] Adding GroupVersion rbac.authorization.k8s.io v1 to ResourceManager
W0922 15:41:32.452344   10642 genericapiserver.go:752] Skipping API rbac.authorization.k8s.io/v1beta1 because it has no resources.
W0922 15:41:32.452348   10642 genericapiserver.go:752] Skipping API rbac.authorization.k8s.io/v1alpha1 because it has no resources.
I0922 15:41:32.453156   10642 handler.go:232] Adding GroupVersion scheduling.k8s.io v1 to ResourceManager
W0922 15:41:32.453164   10642 genericapiserver.go:752] Skipping API scheduling.k8s.io/v1beta1 because it has no resources.
W0922 15:41:32.453167   10642 genericapiserver.go:752] Skipping API scheduling.k8s.io/v1alpha1 because it has no resources.
I0922 15:41:32.454553   10642 handler.go:232] Adding GroupVersion storage.k8s.io v1 to ResourceManager
W0922 15:41:32.454561   10642 genericapiserver.go:752] Skipping API storage.k8s.io/v1beta1 because it has no resources.
W0922 15:41:32.454565   10642 genericapiserver.go:752] Skipping API storage.k8s.io/v1alpha1 because it has no resources.
I0922 15:41:32.455601   10642 handler.go:232] Adding GroupVersion flowcontrol.apiserver.k8s.io v1beta3 to ResourceManager
I0922 15:41:32.456052   10642 handler.go:232] Adding GroupVersion flowcontrol.apiserver.k8s.io v1beta2 to ResourceManager
W0922 15:41:32.456059   10642 genericapiserver.go:752] Skipping API flowcontrol.apiserver.k8s.io/v1beta1 because it has no resources.
W0922 15:41:32.456063   10642 genericapiserver.go:752] Skipping API flowcontrol.apiserver.k8s.io/v1alpha1 because it has no resources.
I0922 15:41:32.458036   10642 handler.go:232] Adding GroupVersion apps v1 to ResourceManager
W0922 15:41:32.458044   10642 genericapiserver.go:752] Skipping API apps/v1beta2 because it has no resources.
W0922 15:41:32.458047   10642 genericapiserver.go:752] Skipping API apps/v1beta1 because it has no resources.
I0922 15:41:32.461010   10642 handler.go:232] Adding GroupVersion admissionregistration.k8s.io v1 to ResourceManager
W0922 15:41:32.461021   10642 genericapiserver.go:752] Skipping API admissionregistration.k8s.io/v1beta1 because it has no resources.
W0922 15:41:32.461042   10642 genericapiserver.go:752] Skipping API admissionregistration.k8s.io/v1alpha1 because it has no resources.
I0922 15:41:32.462942   10642 handler.go:232] Adding GroupVersion events.k8s.io v1 to ResourceManager
W0922 15:41:32.462952   10642 genericapiserver.go:752] Skipping API events.k8s.io/v1beta1 because it has no resources.
I0922 15:41:32.468848   10642 handler.go:232] Adding GroupVersion apiregistration.k8s.io v1 to ResourceManager
W0922 15:41:32.468858   10642 genericapiserver.go:752] Skipping API apiregistration.k8s.io/v1beta1 because it has no resources.
time="2023-09-22T15:41:32+02:00" level=info msg="Password verified locally for node marvel"
time="2023-09-22T15:41:32+02:00" level=info msg="certificate CN=marvel signed by CN=k3s-server-ca@1695389720: notBefore=2023-09-22 13:35:20 +0000 UTC notAfter=2024-09-21 13:41:32 +0000 UTC"
I0922 15:41:32.785044   10642 secure_serving.go:210] Serving securely on 127.0.0.1:6444
I0922 15:41:32.785119   10642 dynamic_cafile_content.go:157] "Starting controller" name="request-header::/var/lib/rancher/k3s/server/tls/request-header-ca.crt"
I0922 15:41:32.785174   10642 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt::/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
I0922 15:41:32.785203   10642 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0922 15:41:32.785481   10642 apiservice_controller.go:97] Starting APIServiceRegistrationController
I0922 15:41:32.785489   10642 cache.go:32] Waiting for caches to sync for APIServiceRegistrationController controller
I0922 15:41:32.785505   10642 available_controller.go:423] Starting AvailableConditionController
I0922 15:41:32.785508   10642 cache.go:32] Waiting for caches to sync for AvailableConditionController controller
I0922 15:41:32.785895   10642 cluster_authentication_trust_controller.go:440] Starting cluster_authentication_trust_controller controller
I0922 15:41:32.785902   10642 shared_informer.go:311] Waiting for caches to sync for cluster_authentication_trust_controller
I0922 15:41:32.786020   10642 gc_controller.go:78] Starting apiserver lease garbage collector
I0922 15:41:32.786040   10642 dynamic_serving_content.go:132] "Starting controller" name="aggregator-proxy-cert::/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt::/var/lib/rancher/k3s/server/tls/client-auth-proxy.key"
I0922 15:41:32.786086   10642 system_namespaces_controller.go:67] Starting system namespaces controller
I0922 15:41:32.786102   10642 handler_discovery.go:412] Starting ResourceDiscoveryManager
I0922 15:41:32.786138   10642 controller.go:83] Starting OpenAPI AggregationController
I0922 15:41:32.786393   10642 customresource_discovery_controller.go:289] Starting DiscoveryController
I0922 15:41:32.786411   10642 apf_controller.go:373] Starting API Priority and Fairness config controller
I0922 15:41:32.786545   10642 controller.go:121] Starting legacy_token_tracking_controller
I0922 15:41:32.786552   10642 shared_informer.go:311] Waiting for caches to sync for configmaps
I0922 15:41:32.786561   10642 controller.go:80] Starting OpenAPI V3 AggregationController
I0922 15:41:32.786645   10642 gc_controller.go:78] Starting apiserver lease garbage collector
I0922 15:41:32.786657   10642 aggregator.go:150] waiting for initial CRD sync...
I0922 15:41:32.786669   10642 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/var/lib/rancher/k3s/server/tls/client-ca.crt"
I0922 15:41:32.788278   10642 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/var/lib/rancher/k3s/server/tls/client-ca.crt"
I0922 15:41:32.788313   10642 dynamic_cafile_content.go:157] "Starting controller" name="request-header::/var/lib/rancher/k3s/server/tls/request-header-ca.crt"
I0922 15:41:32.788630   10642 controller.go:85] Starting OpenAPI controller
I0922 15:41:32.788646   10642 controller.go:85] Starting OpenAPI V3 controller
I0922 15:41:32.788655   10642 naming_controller.go:291] Starting NamingConditionController
I0922 15:41:32.788665   10642 establishing_controller.go:76] Starting EstablishingController
I0922 15:41:32.788678   10642 nonstructuralschema_controller.go:192] Starting NonStructuralSchemaConditionController
I0922 15:41:32.788685   10642 apiapproval_controller.go:186] Starting KubernetesAPIApprovalPolicyConformantConditionController
I0922 15:41:32.788693   10642 crd_finalizer.go:266] Starting CRDFinalizer
I0922 15:41:32.788842   10642 crdregistration_controller.go:111] Starting crd-autoregister controller
I0922 15:41:32.788848   10642 shared_informer.go:311] Waiting for caches to sync for crd-autoregister
E0922 15:41:32.813384   10642 controller.go:159] Error removing old endpoints from kubernetes service: no master IPs were listed in storage, refusing to erase all endpoints for the kubernetes service
I0922 15:41:32.824056   10642 shared_informer.go:318] Caches are synced for node_authorizer
time="2023-09-22T15:41:32+02:00" level=info msg="certificate CN=system:node:marvel,O=system:nodes signed by CN=k3s-client-ca@1695389720: notBefore=2023-09-22 13:35:20 +0000 UTC notAfter=2024-09-21 13:41:32 +0000 UTC"
I0922 15:41:32.886134   10642 cache.go:39] Caches are synced for APIServiceRegistrationController controller
I0922 15:41:32.886160   10642 shared_informer.go:318] Caches are synced for cluster_authentication_trust_controller
I0922 15:41:32.886185   10642 cache.go:39] Caches are synced for AvailableConditionController controller
I0922 15:41:32.886526   10642 apf_controller.go:378] Running API Priority and Fairness config worker
I0922 15:41:32.886537   10642 apf_controller.go:381] Running API Priority and Fairness periodic rebalancing process
I0922 15:41:32.886588   10642 shared_informer.go:318] Caches are synced for configmaps
I0922 15:41:32.886625   10642 handler.go:232] Adding GroupVersion k3s.cattle.io v1 to ResourceManager
I0922 15:41:32.886635   10642 handler.go:232] Adding GroupVersion helm.cattle.io v1 to ResourceManager
I0922 15:41:32.886643   10642 handler.go:232] Adding GroupVersion helm.cattle.io v1 to ResourceManager
I0922 15:41:32.886681   10642 handler.go:232] Adding GroupVersion k3s.cattle.io v1 to ResourceManager
I0922 15:41:32.886692   10642 handler.go:232] Adding GroupVersion helm.cattle.io v1 to ResourceManager
I0922 15:41:32.888928   10642 shared_informer.go:318] Caches are synced for crd-autoregister
I0922 15:41:32.889049   10642 aggregator.go:152] initial CRD sync complete...
I0922 15:41:32.889053   10642 autoregister_controller.go:141] Starting autoregister controller
I0922 15:41:32.889057   10642 cache.go:32] Waiting for caches to sync for autoregister controller
I0922 15:41:32.889061   10642 cache.go:39] Caches are synced for autoregister controller
I0922 15:41:32.900694   10642 controller.go:624] quota admission added evaluator for: leases.coordination.k8s.io
time="2023-09-22T15:41:32+02:00" level=info msg="Module overlay was already loaded"
time="2023-09-22T15:41:32+02:00" level=info msg="Module nf_conntrack was already loaded"
time="2023-09-22T15:41:32+02:00" level=info msg="Module br_netfilter was already loaded"
time="2023-09-22T15:41:32+02:00" level=info msg="Module iptable_nat was already loaded"
time="2023-09-22T15:41:32+02:00" level=info msg="Module iptable_filter was already loaded"
time="2023-09-22T15:41:32+02:00" level=info msg="Running cri-dockerd --cni-bin-dir=/var/lib/rancher/k3s/data/3dfc950bd39d2e2b435291ab8c1333aa6051fcaf46325aee898819f3b99d4b21/bin --cni-conf-dir=/var/lib/rancher/k3s/agent/etc/cni/net.d --container-runtime-endpoint=unix:///run/k3s/cri-dockerd/cri-dockerd.sock --cri-dockerd-root-directory=/var/lib/rancher/k3s/agent/cri-dockerd --log-level=error --network-plugin=cni --pod-infra-container-image=rancher/mirrored-pause:3.6"
time="2023-09-22T15:41:32+02:00" level=info msg="cri-dockerd version v0.3.4-k3s1 (HEAD)"
I0922 15:41:33.636014   10642 controller.go:132] OpenAPI AggregationController: action for item k8s_internal_local_delegation_chain_0000000000: Nothing (removed from the queue).
I0922 15:41:33.789930   10642 storage_scheduling.go:111] all system priority classes are created successfully or already exist.
Flag --cloud-provider has been deprecated, will be removed in 1.25 or later, in favor of removing cloud provider code from Kubelet.
Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.
I0922 15:41:33.964336   10642 server.go:198] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime"
I0922 15:41:33.964828   10642 server.go:410] "Kubelet version" kubeletVersion="v1.27.6+k3s1"
I0922 15:41:33.964835   10642 server.go:412] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I0922 15:41:33.965352   10642 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/var/lib/rancher/k3s/agent/client-ca.crt"
I0922 15:41:33.967076   10642 server.go:657] "--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /"
I0922 15:41:33.967252   10642 container_manager_linux.go:265] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
I0922 15:41:33.967291   10642 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: KubeletOOMScoreAdj:-999 ContainerRuntime: CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] CPUManagerPolicy:none CPUManagerPolicyOptions:map[] TopologyManagerScope:container CPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] PodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms TopologyManagerPolicy:none ExperimentalTopologyManagerPolicyOptions:map[]}
I0922 15:41:33.967307   10642 topology_manager.go:136] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
I0922 15:41:33.967314   10642 container_manager_linux.go:301] "Creating device plugin manager"
I0922 15:41:33.967336   10642 state_mem.go:36] "Initialized new in-memory state store"
I0922 15:41:33.972015   10642 kubelet.go:405] "Attempting to sync node with API server"
I0922 15:41:33.972030   10642 kubelet.go:298] "Adding static pod path" path="/var/lib/rancher/k3s/agent/pod-manifests"
I0922 15:41:33.972045   10642 kubelet.go:309] "Adding apiserver pod source"
I0922 15:41:33.972051   10642 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
I0922 15:41:33.972240   10642 kuberuntime_manager.go:257] "Container runtime initialized" containerRuntime="docker" version="23.0.6" apiVersion="v1"
I0922 15:41:33.972409   10642 server.go:1163] "Started kubelet"
I0922 15:41:33.973026   10642 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
I0922 15:41:33.973496   10642 server.go:162] "Starting to listen" address="0.0.0.0" port=10250
I0922 15:41:33.973985   10642 server.go:461] "Adding debug handlers to kubelet server"
I0922 15:41:33.974303   10642 ratelimit.go:65] "Setting rate limiting for podresources endpoint" qps=100 burstTokens=10
I0922 15:41:33.974794   10642 volume_manager.go:284] "Starting Kubelet Volume Manager"
E0922 15:41:33.975080   10642 cri_stats_provider.go:455] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/lib/docker"
E0922 15:41:33.975091   10642 kubelet.go:1400] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
I0922 15:41:33.975343   10642 desired_state_of_world_populator.go:145] "Desired state populator starts to run"
I0922 15:41:33.975959   10642 kubelet_network_linux.go:63] "Initialized iptables rules." protocol=IPv4
I0922 15:41:33.976296   10642 kubelet_network_linux.go:63] "Initialized iptables rules." protocol=IPv6
I0922 15:41:33.976307   10642 status_manager.go:207] "Starting to sync pod status with apiserver"
I0922 15:41:33.976316   10642 kubelet.go:2257] "Starting kubelet main sync loop"
E0922 15:41:33.976334   10642 kubelet.go:2281] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
W0922 15:41:33.980084   10642 container.go:586] Failed to update stats for container "/": openat2 /sys/fs/cgroup/memory/memory.kmem.limit_in_bytes: no such file or directory, continuing to push stats
E0922 15:41:33.980322   10642 container_manager_linux.go:880] "Unable to get rootfs data from cAdvisor interface" err="unable to find data in memory cache"
E0922 15:41:33.982448   10642 nodelease.go:49] "Failed to get node when trying to set owner ref to the node lease" err="nodes \"marvel\" not found" node="marvel"
I0922 15:41:33.985424   10642 cpu_manager.go:214] "Starting CPU manager" policy="none"
I0922 15:41:33.985431   10642 cpu_manager.go:215] "Reconciling" reconcilePeriod="10s"
I0922 15:41:33.985439   10642 state_mem.go:36] "Initialized new in-memory state store"
I0922 15:41:33.985501   10642 state_mem.go:88] "Updated default CPUSet" cpuSet=""
I0922 15:41:33.985508   10642 state_mem.go:96] "Updated CPUSet assignments" assignments=map[]
I0922 15:41:33.985512   10642 policy_none.go:49] "None policy: Start"
I0922 15:41:33.987407   10642 memory_manager.go:169] "Starting memorymanager" policy="None"
I0922 15:41:33.987419   10642 state_mem.go:35] "Initializing new in-memory state store"
I0922 15:41:33.987486   10642 state_mem.go:75] "Updated machine memory state"
E0922 15:41:33.987497   10642 kubelet.go:1480] "Failed to start ContainerManager" err="failed to get rootfs info: unable to find data in memory cache"

Additional context / logs: N/A.

brandond commented 12 months ago

E0922 15:41:33.987497 10642 kubelet.go:1480] "Failed to start ContainerManager" err="failed to get rootfs info: unable to find data in memory cache" The crash is coming from the kubelet, not any of the code here in this project. Specifically, it seems to be caused by cadvisor not being able to retrieve information on the kubelet's root dir at /var/lib/kubelet. Is there anything odd about the filesystem or mount points on this node?

Ref: https://github.com/kubernetes/kubernetes/blob/v1.27.6/pkg/kubelet/cm/container_manager_linux.go#L586-L594

rubin55 commented 12 months ago

@brandond no, this VM is like our other nodes simply an Alpine 3.18 with about 100GB virtual disk, ext4 formatted. The setup of the VM is identical to at least 5 other VMs that run on KVM. It might be a wild goose chase, but it's the only significant difference. We set these machines up with Ansible, no manual work, so I'm pretty sure about the no-differences in configuration of Alpine, Docker and K3S part.

brandond commented 12 months ago

I don't think there's anything we can do on our side to fix this; the error is deep in the kubelet cadvisor stuff which we do not modify in any way.

Do you run into this same error if you use containerd instead of docker? Why are you tied to using docker?

rubin55 commented 12 months ago

The issue upstream that seems to collect this is https://github.com/kubernetes/kubernetes/issues/113066

We're using Docker on these hosts to interface with docker command-line clients. The hosts are essentially docker/k3s development nodes.

brandond commented 12 months ago

I've seen similar errors from containerd when it has just started up and hasn't collected filesystem stats yet, although those aren't usually fatal.

I don't know if docker works similarly, although docker is at this point just a wrapper around containerd anyway... but you might see if waiting 10-15 seconds after docker starts up allows the kubelet to start without running into this error?

rubin55 commented 12 months ago

I've tried various timing things, unfortunately, it never recovers, not even if I wait a few minutes for docker to idle before starting the install. I just did a fully clean install, so formatted disk -> alpine -> docker -> k3s and I'm still getting exactly the same error.

I initially thought this was isolated/related to Hyper-V. Just to be sure I just did a full re-installation on a KVM host also and I'm getting the same error, so Hyper-V is not related to this.

This is really weird as I did exactly these steps on september 19th and everything worked.

A bit at a loss what to try next.

rubin55 commented 12 months ago

This is probably the root cause of this: https://github.com/kubernetes/kubernetes/issues/120813

Update: yep, verified. All working installs were on kernel 6.1.53. Kernel 6.1.54 came out the same day I was doing those setups, but was not synced to my mirror yet. All failing systems, regardless of underlying hypervisor are on kernel 6.1.54.

brandond commented 12 months ago

Huh. I'm honestly surprised to find a modern distro, on a new kernel, still using cgroup v1. Can you switch to v2?

rubin55 commented 12 months ago

@brandond how would I do that? I seem to support both:

# grep cgroup /proc/filesystems
nodev   cgroup
nodev   cgroup2

brandond commented 12 months ago

That's called hybrid, and it doesn't work great either. Try https://liet.me/2022/07/08/enable-cgroups-v2-in-alpine-linux/ ?

rubin55 commented 12 months ago

@brandond I didn't have a moment to check that cgoup v2 settting, but I did upgrade to Linux kernel 6.1.55 which just came out and fixes the problem.

k3s-io / k3s

K3S on Alpine 3.18 using Docker crashing after install: failed to get rootfs info: unable to find data in memory cache #8404