Closed agaxprex closed 6 months ago
Please attach the complete rke2-server log from journald, as well as the kubelet log. The information you've provided is not sufficient to determine why pods aren't being started by the kubelet.
kubelet.log
Flag --volume-plugin-dir has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --file-check-frequency has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --sync-frequency has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --address has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --authentication-token-webhook has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --authorization-mode has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --client-ca-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cluster-dns has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cluster-domain has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
Flag --eviction-hard has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --eviction-minimum-reclaim has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --fail-swap-on has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --feature-gates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --healthz-bind-address has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --kubelet-cgroups has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.
Flag --pod-manifest-path has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --read-only-port has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --serialize-image-pulls has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --tls-cert-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --tls-private-key-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
I0517 21:05:16.356981 28157 server.go:204] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime"
I0517 21:05:16.358658 28157 server.go:487] "Kubelet version" kubeletVersion="v1.29.0+rke2r1"
I0517 21:05:16.358670 28157 server.go:489] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I0517 21:05:16.360876 28157 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/var/lib/rancher/rke2/agent/client-ca.crt"
I0517 21:05:16.388790 28157 server.go:745] "--cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /"
I0517 21:05:16.389246 28157 container_manager_linux.go:265] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
I0517 21:05:16.389354 28157 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={"RuntimeCgroupsName":"","SystemCgroupsName":"","KubeletCgroupsName":"/rke2","KubeletOOMScoreAdj":-999,"ContainerRuntime":"","CgroupsPerQOS":true,"CgroupRoot":"/","CgroupDriver":"cgroupfs","KubeletRootDir":"/var/lib/kubelet","ProtectKernelDefaults":false,"KubeReservedCgroupName":"","SystemReservedCgroupName":"","ReservedSystemCPUs":{},"EnforceNodeAllocatable":{"pods":{}},"KubeReserved":null,"SystemReserved":null,"HardEvictionThresholds":[{"Signal":"imagefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null},{"Signal":"nodefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null}],"QOSReserved":{},"CPUManagerPolicy":"none","CPUManagerPolicyOptions":null,"TopologyManagerScope":"container","CPUManagerReconcilePeriod":10000000000,"ExperimentalMemoryManagerPolicy":"None","ExperimentalMemoryManagerReservedMemory":null,"PodPidsLimit":-1,"EnforceCPULimits":true,"CPUCFSQuotaPeriod":100000000,"TopologyManagerPolicy":"none","TopologyManagerPolicyOptions":null}
I0517 21:05:16.389368 28157 topology_manager.go:138] "Creating topology manager with none policy"
I0517 21:05:16.389375 28157 container_manager_linux.go:301] "Creating device plugin manager"
I0517 21:05:16.389394 28157 state_mem.go:36] "Initialized new in-memory state store"
I0517 21:05:16.589790 28157 server.go:863] "Failed to ApplyOOMScoreAdj" err="write /proc/self/oom_score_adj: permission denied"
I0517 21:05:16.589837 28157 kubelet.go:396] "Attempting to sync node with API server"
I0517 21:05:16.589848 28157 kubelet.go:301] "Adding static pod path" path="/var/lib/rancher/rke2/agent/pod-manifests"
I0517 21:05:16.589863 28157 kubelet.go:312] "Adding apiserver pod source"
I0517 21:05:16.589873 28157 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
I0517 21:05:16.590147 28157 kuberuntime_manager.go:258] "Container runtime initialized" containerRuntime="containerd" version="v1.7.11-k3s2" apiVersion="v1"
W0517 21:05:16.590165 28157 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.Node: Get "https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3Dcontrol-node-0&limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0517 21:05:16.590198 28157 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3Dcontrol-node-0&limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
W0517 21:05:16.590235 28157 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0517 21:05:16.590260 28157 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
I0517 21:05:16.590260 28157 kubelet.go:809] "Not starting ClusterTrustBundle informer because we are in static kubelet mode"
E0517 21:05:16.590500 28157 server.go:1245] "Failed to set rlimit on max file handles" err="operation not permitted"
I0517 21:05:16.590509 28157 server.go:1256] "Started kubelet"
I0517 21:05:16.590591 28157 server.go:162] "Starting to listen" address="0.0.0.0" port=10250
I0517 21:05:16.590739 28157 ratelimit.go:55] "Setting rate limiting for endpoint" service="podresources" qps=100 burstTokens=10
E0517 21:05:16.590781 28157 event.go:355] "Unable to write event (may retry after sleeping)" err="Post \"https://127.0.0.1:6443/api/v1/namespaces/default/events\": dial tcp 127.0.0.1:6443: connect: connection refused" event="&Event{ObjectMeta:{control-node-0.17d062a47a149f57 default 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:control-node-0,UID:control-node-0,APIVersion:,ResourceVersion:,FieldPath:,},Reason:Starting,Message:Starting kubelet.,Source:EventSource{Component:kubelet,Host:control-node-0,},FirstTimestamp:2024-05-17 21:05:16.590489431 +0000 UTC m=+0.259499249,LastTimestamp:2024-05-17 21:05:16.590489431 +0000 UTC m=+0.259499249,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:kubelet,ReportingInstance:control-node-0,}"
I0517 21:05:16.590824 28157 server.go:233] "Starting to serve the podresources API" endpoint="unix:/var/lib/kubelet/pod-resources/kubelet.sock"
E0517 21:05:16.591404 28157 kubelet.go:1462] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
I0517 21:05:16.591659 28157 server.go:461] "Adding debug handlers to kubelet server"
I0517 21:05:16.592274 28157 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
I0517 21:05:16.592380 28157 volume_manager.go:291] "Starting Kubelet Volume Manager"
I0517 21:05:16.592431 28157 desired_state_of_world_populator.go:151] "Desired state populator starts to run"
I0517 21:05:16.593286 28157 reconciler_new.go:29] "Reconciler: start to sync state"
I0517 21:05:16.593317 28157 factory.go:221] Registration of the systemd container factory successfully
I0517 21:05:16.593388 28157 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
E0517 21:05:16.593796 28157 controller.go:145] "Failed to ensure lease exists, will retry" err="Get \"https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/control-node-0?timeout=10s\": dial tcp 127.0.0.1:6443: connect: connection refused" interval="200ms"
W0517 21:05:16.593567 28157 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.CSIDriver: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0517 21:05:16.593819 28157 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
I0517 21:05:16.594315 28157 factory.go:221] Registration of the containerd container factory successfully
I0517 21:05:16.600648 28157 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv4"
I0517 21:05:16.601721 28157 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv6"
I0517 21:05:16.601738 28157 status_manager.go:217] "Starting to sync pod status with apiserver"
I0517 21:05:16.601749 28157 kubelet.go:2329] "Starting kubelet main sync loop"
E0517 21:05:16.601772 28157 kubelet.go:2353] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
W0517 21:05:16.601991 28157 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.RuntimeClass: Get "https://127.0.0.1:6443/apis/node.k8s.io/v1/runtimeclasses?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0517 21:05:16.602015 28157 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.RuntimeClass: failed to list *v1.RuntimeClass: Get "https://127.0.0.1:6443/apis/node.k8s.io/v1/runtimeclasses?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
I0517 21:05:16.640298 28157 cpu_manager.go:214] "Starting CPU manager" policy="none"
I0517 21:05:16.640309 28157 cpu_manager.go:215] "Reconciling" reconcilePeriod="10s"
I0517 21:05:16.640322 28157 state_mem.go:36] "Initialized new in-memory state store"
I0517 21:05:16.640421 28157 state_mem.go:88] "Updated default CPUSet" cpuSet=""
I0517 21:05:16.640440 28157 state_mem.go:96] "Updated CPUSet assignments" assignments={}
I0517 21:05:16.640447 28157 policy_none.go:49] "None policy: Start"
I0517 21:05:16.640698 28157 memory_manager.go:170] "Starting memorymanager" policy="None"
I0517 21:05:16.640710 28157 state_mem.go:35] "Initializing new in-memory state store"
I0517 21:05:16.640816 28157 state_mem.go:75] "Updated machine memory state"
E0517 21:05:16.642305 28157 node_container_manager_linux.go:61] "Failed to create cgroup" err="mkdir /sys/fs/cgroup/cpuset/kubepods: permission denied" cgroupName=["kubepods"]
E0517 21:05:16.642316 28157 kubelet.go:1542] "Failed to start ContainerManager" err="mkdir /sys/fs/cgroup/cpuset/kubepods: permission denied"
journalctl -u rke2-server journalctl.txt
Note, I'm running rke2 as a root user, so not sure why permission denied errors occur. SELinux is disabled for the testing of this.
E0517 21:05:16.642305 28157 node_container_manager_linux.go:61] "Failed to create cgroup" err="mkdir /sys/fs/cgroup/cpuset/kubepods: permission denied" cgroupName=["kubepods"]
E0517 21:05:16.642316 28157 kubelet.go:1542] "Failed to start ContainerManager" err="mkdir /sys/fs/cgroup/cpuset/kubepods: permission denied"
Is selinux enabled on this node? If you're running rke2 on an selinux-enabled host, you must install the rke2-selinux package and ensure that the correct contexts are set. Normally this is handled by the rke2 RPMs, but if you're using airgap and a tarball install you will need to do this manually.
This sounds like a duplicate of
Disabled SELinux permissions, but seems kubelet is still crashing, not having to do with permissions but the I think with the driver instead. The system is enabled with cgroups-v2, would that pervent kubelet from starting ContainerManager?
E0518 19:48:52.148026 733385 cgroup_manager_linux.go:476] cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods.slice/cpu.weight: no such file or directory
E0518 19:48:52.148149 733385 kubelet.go:1542] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"
Full log posted: kubelet.log
Are you missing critical cgroups? What is the output of grep cgroup /proc/mounts
and cat /proc/cgroups
?
Think I'm missing some associated with cpu
, not familiar with the ones kubelet requires or how to enable them. Currently using cgroups v2. The system I'm in did previously have K3s running on it, just testing a migration to RKE2. Not sure if K3s altered what was enabled.
$ cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 0 1041 1
cpu 0 1041 1
cpuacct 0 1041 1
blkio 0 1041 1
memory 0 1041 1
devices 0 1041 1
freezer 0 1041 1
net_cls 0 1041 1
perf_event 0 1041 1
net_prio 0 1041 1
hugetlb 0 1041 1
pids 0 1041 1
rdma 0 1041 1
$ grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
The system I'm in did previously have K3s running on it, just testing a migration to RKE2.
Have you run the k3s uninstall script and rebooted the node since it was running k3s?
Yep was uninstalled via the script and rebooted before installing.
mkdir /sys/fs/cgroup/cpuset/kubepods: permission denied
The failure to create the root cgroup seems to be the critical error. Do you have any other security modules present on this system that would prevent the kubelet from creating new cgroups?
I fixed the previous error, but got a new error. I don't see that error in the second kubelet.log I uploaded, rather a new one. Here is the updated one:
Flag --volume-plugin-dir has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --file-check-frequency has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --sync-frequency has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --address has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --authentication-token-webhook has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --authorization-mode has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --client-ca-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cluster-dns has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cluster-domain has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
Flag --eviction-hard has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --eviction-minimum-reclaim has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --fail-swap-on has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --feature-gates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --healthz-bind-address has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.
Flag --pod-manifest-path has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --read-only-port has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --serialize-image-pulls has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --tls-cert-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --tls-private-key-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
I0518 19:48:51.965331 733385 server.go:204] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime"
I0518 19:48:51.967025 733385 server.go:487] "Kubelet version" kubeletVersion="v1.29.0+rke2r1"
I0518 19:48:51.967037 733385 server.go:489] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I0518 19:48:51.968569 733385 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/var/lib/rancher/rke2/agent/client-ca.crt"
I0518 19:48:52.047736 733385 server.go:745] "--cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /"
I0518 19:48:52.047849 733385 container_manager_linux.go:265] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
I0518 19:48:52.047972 733385 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={"RuntimeCgroupsName":"","SystemCgroupsName":"","KubeletCgroupsName":"","KubeletOOMScoreAdj":-999,"ContainerRuntime":"","CgroupsPerQOS":true,"CgroupRoot":"/","CgroupDriver":"systemd","KubeletRootDir":"/var/lib/kubelet","ProtectKernelDefaults":false,"KubeReservedCgroupName":"","SystemReservedCgroupName":"","ReservedSystemCPUs":{},"EnforceNodeAllocatable":{"pods":{}},"KubeReserved":null,"SystemReserved":null,"HardEvictionThresholds":[{"Signal":"nodefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null},{"Signal":"imagefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null}],"QOSReserved":{},"CPUManagerPolicy":"none","CPUManagerPolicyOptions":null,"TopologyManagerScope":"container","CPUManagerReconcilePeriod":10000000000,"ExperimentalMemoryManagerPolicy":"None","ExperimentalMemoryManagerReservedMemory":null,"PodPidsLimit":-1,"EnforceCPULimits":true,"CPUCFSQuotaPeriod":100000000,"TopologyManagerPolicy":"none","TopologyManagerPolicyOptions":null}
I0518 19:48:52.047988 733385 topology_manager.go:138] "Creating topology manager with none policy"
I0518 19:48:52.047994 733385 container_manager_linux.go:301] "Creating device plugin manager"
I0518 19:48:52.048013 733385 state_mem.go:36] "Initialized new in-memory state store"
I0518 19:48:52.048087 733385 kubelet.go:396] "Attempting to sync node with API server"
I0518 19:48:52.048097 733385 kubelet.go:301] "Adding static pod path" path="/var/lib/rancher/rke2/agent/pod-manifests"
I0518 19:48:52.048115 733385 kubelet.go:312] "Adding apiserver pod source"
I0518 19:48:52.048125 733385 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
I0518 19:48:52.048474 733385 kuberuntime_manager.go:258] "Container runtime initialized" containerRuntime="containerd" version="v1.7.11-k3s2" apiVersion="v1"
W0518 19:48:52.048473 733385 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0518 19:48:52.048509 733385 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
W0518 19:48:52.048498 733385 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.Node: Get "https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3De6le100&limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0518 19:48:52.048525 733385 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3De6le100&limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
I0518 19:48:52.048582 733385 kubelet.go:809] "Not starting ClusterTrustBundle informer because we are in static kubelet mode"
I0518 19:48:52.048814 733385 server.go:1256] "Started kubelet"
I0518 19:48:52.048850 733385 server.go:162] "Starting to listen" address="0.0.0.0" port=10250
I0518 19:48:52.048880 733385 ratelimit.go:55] "Setting rate limiting for endpoint" service="podresources" qps=100 burstTokens=10
I0518 19:48:52.049000 733385 server.go:233] "Starting to serve the podresources API" endpoint="unix:/var/lib/kubelet/pod-resources/kubelet.sock"
E0518 19:48:52.049121 733385 event.go:355] "Unable to write event (may retry after sleeping)" err="Post \"https://127.0.0.1:6443/api/v1/namespaces/default/events\": dial tcp 127.0.0.1:6443: connect: connection refused" event="&Event{ObjectMeta:{e6le100.17d0ba26624b8df1 default 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:e6le100,UID:e6le100,APIVersion:,ResourceVersion:,FieldPath:,},Reason:Starting,Message:Starting kubelet.,Source:EventSource{Component:kubelet,Host:e6le100,},FirstTimestamp:2024-05-18 19:48:52.048801265 -0400 EDT m=+0.110198176,LastTimestamp:2024-05-18 19:48:52.048801265 -0400 EDT m=+0.110198176,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:kubelet,ReportingInstance:e6le100,}"
I0518 19:48:52.049765 733385 server.go:461] "Adding debug handlers to kubelet server"
E0518 19:48:52.049883 733385 kubelet.go:1462] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
I0518 19:48:52.049890 733385 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
I0518 19:48:52.049929 733385 volume_manager.go:291] "Starting Kubelet Volume Manager"
I0518 19:48:52.049986 733385 desired_state_of_world_populator.go:151] "Desired state populator starts to run"
I0518 19:48:52.050005 733385 reconciler_new.go:29] "Reconciler: start to sync state"
E0518 19:48:52.050212 733385 controller.go:145] "Failed to ensure lease exists, will retry" err="Get \"https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/e6le100?timeout=10s\": dial tcp 127.0.0.1:6443: connect: connection refused" interval="200ms"
I0518 19:48:52.050302 733385 factory.go:221] Registration of the systemd container factory successfully
I0518 19:48:52.050362 733385 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
W0518 19:48:52.050526 733385 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.CSIDriver: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0518 19:48:52.050567 733385 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
I0518 19:48:52.052015 733385 factory.go:221] Registration of the containerd container factory successfully
I0518 19:48:52.057899 733385 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv4"
I0518 19:48:52.058875 733385 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv6"
I0518 19:48:52.058889 733385 status_manager.go:217] "Starting to sync pod status with apiserver"
I0518 19:48:52.058898 733385 kubelet.go:2329] "Starting kubelet main sync loop"
E0518 19:48:52.058928 733385 kubelet.go:2353] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
W0518 19:48:52.059112 733385 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.RuntimeClass: Get "https://127.0.0.1:6443/apis/node.k8s.io/v1/runtimeclasses?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0518 19:48:52.059148 733385 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.RuntimeClass: failed to list *v1.RuntimeClass: Get "https://127.0.0.1:6443/apis/node.k8s.io/v1/runtimeclasses?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
I0518 19:48:52.109545 733385 cpu_manager.go:214] "Starting CPU manager" policy="none"
I0518 19:48:52.109556 733385 cpu_manager.go:215] "Reconciling" reconcilePeriod="10s"
I0518 19:48:52.109566 733385 state_mem.go:36] "Initialized new in-memory state store"
I0518 19:48:52.110560 733385 state_mem.go:88] "Updated default CPUSet" cpuSet=""
I0518 19:48:52.110578 733385 state_mem.go:96] "Updated CPUSet assignments" assignments={}
I0518 19:48:52.110584 733385 policy_none.go:49] "None policy: Start"
I0518 19:48:52.110930 733385 memory_manager.go:170] "Starting memorymanager" policy="None"
I0518 19:48:52.110946 733385 state_mem.go:35] "Initializing new in-memory state store"
I0518 19:48:52.111060 733385 state_mem.go:75] "Updated machine memory state"
E0518 19:48:52.148026 733385 cgroup_manager_linux.go:476] cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods.slice/cpu.weight: no such file or directory
E0518 19:48:52.148149 733385 kubelet.go:1542] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"
I do notice the following:
I0518 19:48:52.109545 733385 cpu_manager.go:214] "Starting CPU manager" policy="none"
I0518 19:48:52.109556 733385 cpu_manager.go:215] "Reconciling" reconcilePeriod="10s"
I0518 19:48:52.109566 733385 state_mem.go:36] "Initialized new in-memory state store"
I0518 19:48:52.110560 733385 state_mem.go:88] "Updated default CPUSet" cpuSet=""
cpuSet seems to be empty.
That would probably be expected, since you're not using a cpu manager (policy="none")
I see that you're providing an rke2 config file with cni: none
and presumably planning to deploy your own CNI, are you by any chance also providing a custom containerd config template or anything else along those lines?
cni: none
was set just to simply test if Kubelet launches. Eventually planning on launching Multus with Canal or Cilium, but holding off for now. Just running the with default containerd configuration, and it is able to load images. kubelet fails to run them. The problem is ContainerManager fails to start, which causes kubelet to exit, so containers don't launch. Not sure if it is related to missing cgroups, or if there is another way to debug what the following error refers to:
E0518 19:48:52.148149 733385 kubelet.go:1542] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"
It seems like something is going on with the cgroups but I'm not sure what. First the permissions errors and now just complaining that expected cgroup containers don't exist. Are you sure everything's been cleaned off this node? There is nothing else trying to manage the kubelet's expected cgroups? Are you able to reboot again to perhaps get things reset to a known-good state?
I do have podman
running on the node, but that's the only other component that uses cgroups that's still on the system. Would it be safe to clear out /sys/fs/cgroups
and reboot? Not sure if the directory would be recreated.
cgroups don't persist across boots, so you shouldn't need to manually clear anything out.
Wondering if I'm missing any kernel parameters that need to be enabled in grub. Will take a look then reboot.
Resolved: Apparently I was missing some cgroups. Notably when cgroups v2 was enabled in kernel 4.0 it was missing the cpu controller for the subtrees, meaning kubepods.slice was unable to use it. Swapping the system back to cgroups v1 resolved all issues. Disabled hybrid support as well, as that also caused problems.
Environmental Info: RKE2 Version: rke2 version v1.29.0+rke2r1 (4fd30c26c91dd3f2f623c5af00d1ebcfec8c2709) go version go1.21.5 X:boringcrypto
Node(s) CPU architecture, OS, and Version: Linux control-node-0 4.18.0-513.24.1.el8_9.x86_64 #1 SMP Thu Mar 14 14:20:09 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration: 1 server node
Describe the bug: On the start of the RKE2 service from an airgap install script, containerd fails to launch any containers:
Steps To Reproduce:
systemctl start rke2-server
starts.The
config.yaml
file doesn't enable a cni, containerd doesn't launch any containers.Expected behavior: Containerd downloads images from tar files and runs them, then kubelet is able to start the cluster.
Actual behavior: Containerd and kubelet are running and with open ports:
Images are downloaded
crictl image ls
:But the containers are not run:
Additional context / logs:
containerd.log