RKE2 Airgap Install: containerd fails to create and run containers

agaxprex commented 6 months ago

Environmental Info: RKE2 Version: rke2 version v1.29.0+rke2r1 (4fd30c26c91dd3f2f623c5af00d1ebcfec8c2709) go version go1.21.5 X:boringcrypto

Node(s) CPU architecture, OS, and Version: Linux control-node-0 4.18.0-513.24.1.el8_9.x86_64 #1 SMP Thu Mar 14 14:20:09 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 1 server node

Describe the bug: On the start of the RKE2 service from an airgap install script, containerd fails to launch any containers:

Steps To Reproduce:

Using setup.sh script to install RKE2, simply migrates downloaded tar files to proper directories and runs the install script.

IMAGE_DIR=/var/lib/rancher/rke2/agent/images
MANIFEST_DIR=/var/lib/rancher/rke2/agent/images
CONFIG_DIR=/etc/rancher/rke2

mkdir -vp $IMAGE_DIR
mkdir -vp $MANIFEST_DIR
mkdir -vp $CONFIG_DIR

cp -v $(pwd)/rke2-images-core.linux-amd64.tar.gz $IMAGE_DIR/
cp -v $(pwd)/config.yaml $CONFIG_DIR

export PATH=/var/lib/rancher/rke2/bin:$PATH
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
INSTALL_RKE2_ARTIFACT_PATH=$(pwd) INSTALL_RKE2_METHOD=tar sh install.sh

Install is successful, next systemctl start rke2-server starts.
Containerd and kubelet is started.
Service fails.

The config.yaml file doesn't enable a cni, containerd doesn't launch any containers.

cni:
 - none
tls-san:
 - "localhost"
write-kubeconfig-mode: "0644"

Expected behavior: Containerd downloads images from tar files and runs them, then kubelet is able to start the cluster.

Actual behavior: Containerd and kubelet are running and with open ports:

[root@control-node-0 rke2-artifacts]# ss -tunlp
Netid       State        Recv-Q       Send-Q             Local Address:Port              Peer Address:Port      Process
tcp         LISTEN       0            128                    127.0.0.1:10010                  0.0.0.0:*          users:(("containerd",pid=9761,fd=14))
tcp         LISTEN       0            128                            *:9345                         *:*          users:(("rke2",pid=9729,fd=7))

Images are downloaded crictl image ls:

[root@control-node-0 rke2-artifacts]# crictl image ls
IMAGE                                                                TAG                                        IMAGE ID            SIZE
docker.io/rancher/hardened-cluster-autoscaler                        v1.8.6-build20231009                       1b5aedc400f2a       159MB
docker.io/rancher/hardened-coredns                                   v1.10.1-build20231009                      99513a1cabd6f       180MB
docker.io/rancher/hardened-dns-node-cache                            1.22.20-build20231010                      760c20ed3d883       187MB
docker.io/rancher/hardened-etcd                                      v3.5.9-k3s1-build20230802                  c6b7a4f2f79b2       168MB
docker.io/rancher/hardened-k8s-metrics-server                        v0.6.3-build20231009                       9051c01da7c50       173MB
docker.io/rancher/hardened-kubernetes                                v1.29.0-rke2r1-build20231213               ab7fea56055ba       724MB
docker.io/rancher/klipper-helm                                       v0.8.2-build20230815                       5f89cb8137ccb       256MB
docker.io/rancher/klipper-lb                                         v0.4.4                                     af74bd845c4a8       12.5MB
docker.io/rancher/mirrored-ingress-nginx-kube-webhook-certgen        v20230312-helm-chart-4.5.2-28-g66a760794   5a86b03a88d23       48.5MB
docker.io/rancher/mirrored-pause                                     3.6                                        6270bb605e12e       686kB
docker.io/rancher/mirrored-sig-storage-snapshot-controller           v6.2.1                                     1ef6c138bd5f2       58.4MB
docker.io/rancher/mirrored-sig-storage-snapshot-validation-webhook   v6.2.2                                     ff52c2bcf9f88       49MB
docker.io/rancher/nginx-ingress-controller                           nginx-1.9.3-hardened1                      bfdece8fa3f14       800MB
docker.io/rancher/rke2-cloud-provider                                v1.28.2-build20231016                      00bad7e3c2263       188MB
docker.io/rancher/rke2-runtime                                       v1.29.0-rke2r1                             4f24500e13075       338MB

But the containers are not run:

[root@control-node-0 rke2-artifacts]# crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD

Additional context / logs:

containerd.log

[root@control-node-0 containerd]# cat containerd.log
time="2024-05-17T17:25:08.979102910Z" level=info msg="starting containerd" revision=64c566767b99d0bacf26fb6e2355775c413331c9 version=v1.7.11-k3s2
time="2024-05-17T17:25:08.992331620Z" level=info msg="loading plugin \"io.containerd.warning.v1.deprecations\"..." type=io.containerd.warning.v1
time="2024-05-17T17:25:08.992353418Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.aufs\"..." type=io.containerd.snapshotter.v1
time="2024-05-17T17:25:08.993537408Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"modprobe: FATAL: Module aufs not found in directory /lib/modules/4.18.0-513.24.1.el8_9.x86_64\\n\"): skip plugin" type=io.containerd.snapshotter.v1
time="2024-05-17T17:25:08.993552099Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.btrfs\"..." type=io.containerd.snapshotter.v1
time="2024-05-17T17:25:08.993947263Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.btrfs\"..." error="path /var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.btrfs (tmpfs) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2024-05-17T17:25:08.993957186Z" level=info msg="loading plugin \"io.containerd.content.v1.content\"..." type=io.containerd.content.v1
time="2024-05-17T17:25:08.993977320Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.blockfile\"..." type=io.containerd.snapshotter.v1
time="2024-05-17T17:25:08.993992225Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.blockfile\"..." error="no scratch file generator: skip plugin" type=io.containerd.snapshotter.v1
time="2024-05-17T17:25:08.993998364Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.native\"..." type=io.containerd.snapshotter.v1
time="2024-05-17T17:25:08.994014149Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.overlayfs\"..." type=io.containerd.snapshotter.v1
time="2024-05-17T17:25:08.994083804Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.fuse-overlayfs\"..." type=io.containerd.snapshotter.v1
time="2024-05-17T17:25:08.994103767Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.stargz\"..." type=io.containerd.snapshotter.v1
time="2024-05-17T17:25:08.994114519Z" level=warning msg="Could not retrieve statfs for /var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.stargz/snapshotter" error="no such file or directory"
time="2024-05-17T17:25:09.010246964Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.devmapper\"..." type=io.containerd.snapshotter.v1
time="2024-05-17T17:25:09.010260400Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
time="2024-05-17T17:25:09.010266861Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.zfs\"..." type=io.containerd.snapshotter.v1
time="2024-05-17T17:25:09.010622750Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2024-05-17T17:25:09.010630854Z" level=info msg="loading plugin \"io.containerd.metadata.v1.bolt\"..." type=io.containerd.metadata.v1
time="2024-05-17T17:25:09.010644404Z" level=warning msg="could not use snapshotter devmapper in metadata plugin" error="devmapper not configured"
time="2024-05-17T17:25:09.010651221Z" level=info msg="metadata content store policy set" policy=shared
time="2024-05-17T17:25:09.010721653Z" level=info msg="loading plugin \"io.containerd.differ.v1.walking\"..." type=io.containerd.differ.v1
time="2024-05-17T17:25:09.010733232Z" level=info msg="loading plugin \"io.containerd.event.v1.exchange\"..." type=io.containerd.event.v1
time="2024-05-17T17:25:09.010740647Z" level=info msg="loading plugin \"io.containerd.gc.v1.scheduler\"..." type=io.containerd.gc.v1
time="2024-05-17T17:25:09.010755957Z" level=info msg="loading plugin \"io.containerd.lease.v1.manager\"..." type=io.containerd.lease.v1
time="2024-05-17T17:25:09.010764472Z" level=info msg="loading plugin \"io.containerd.nri.v1.nri\"..." type=io.containerd.nri.v1
time="2024-05-17T17:25:09.010772024Z" level=info msg="NRI interface is disabled by configuration."
time="2024-05-17T17:25:09.010779661Z" level=info msg="loading plugin \"io.containerd.runtime.v2.task\"..." type=io.containerd.runtime.v2
time="2024-05-17T17:25:09.010808173Z" level=info msg="loading plugin \"io.containerd.runtime.v2.shim\"..." type=io.containerd.runtime.v2
time="2024-05-17T17:25:09.010821749Z" level=info msg="loading plugin \"io.containerd.sandbox.store.v1.local\"..." type=io.containerd.sandbox.store.v1
time="2024-05-17T17:25:09.010830572Z" level=info msg="loading plugin \"io.containerd.sandbox.controller.v1.local\"..." type=io.containerd.sandbox.controller.v1
time="2024-05-17T17:25:09.010839196Z" level=info msg="loading plugin \"io.containerd.streaming.v1.manager\"..." type=io.containerd.streaming.v1
time="2024-05-17T17:25:09.010849132Z" level=info msg="loading plugin \"io.containerd.service.v1.introspection-service\"..." type=io.containerd.service.v1
time="2024-05-17T17:25:09.010860246Z" level=info msg="loading plugin \"io.containerd.service.v1.containers-service\"..." type=io.containerd.service.v1
time="2024-05-17T17:25:09.010869336Z" level=info msg="loading plugin \"io.containerd.service.v1.content-service\"..." type=io.containerd.service.v1
time="2024-05-17T17:25:09.010877158Z" level=info msg="loading plugin \"io.containerd.service.v1.diff-service\"..." type=io.containerd.service.v1
time="2024-05-17T17:25:09.010886121Z" level=info msg="loading plugin \"io.containerd.service.v1.images-service\"..." type=io.containerd.service.v1
time="2024-05-17T17:25:09.010894842Z" level=info msg="loading plugin \"io.containerd.service.v1.namespaces-service\"..." type=io.containerd.service.v1
time="2024-05-17T17:25:09.010902479Z" level=info msg="loading plugin \"io.containerd.service.v1.snapshots-service\"..." type=io.containerd.service.v1
time="2024-05-17T17:25:09.010909703Z" level=info msg="loading plugin \"io.containerd.runtime.v1.linux\"..." type=io.containerd.runtime.v1
time="2024-05-17T17:25:09.010930136Z" level=info msg="loading plugin \"io.containerd.monitor.v1.cgroups\"..." type=io.containerd.monitor.v1
time="2024-05-17T17:25:09.011115609Z" level=info msg="loading plugin \"io.containerd.service.v1.tasks-service\"..." type=io.containerd.service.v1
time="2024-05-17T17:25:09.011130514Z" level=info msg="loading plugin \"io.containerd.grpc.v1.introspection\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011139884Z" level=info msg="loading plugin \"io.containerd.transfer.v1.local\"..." type=io.containerd.transfer.v1
time="2024-05-17T17:25:09.011153616Z" level=info msg="loading plugin \"io.containerd.internal.v1.restart\"..." type=io.containerd.internal.v1
time="2024-05-17T17:25:09.011187868Z" level=info msg="loading plugin \"io.containerd.grpc.v1.containers\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011197439Z" level=info msg="loading plugin \"io.containerd.grpc.v1.content\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011205617Z" level=info msg="loading plugin \"io.containerd.grpc.v1.diff\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011214455Z" level=info msg="loading plugin \"io.containerd.grpc.v1.events\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011222374Z" level=info msg="loading plugin \"io.containerd.grpc.v1.healthcheck\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011229668Z" level=info msg="loading plugin \"io.containerd.grpc.v1.images\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011237420Z" level=info msg="loading plugin \"io.containerd.grpc.v1.leases\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011244316Z" level=info msg="loading plugin \"io.containerd.grpc.v1.namespaces\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011253516Z" level=info msg="loading plugin \"io.containerd.internal.v1.opt\"..." type=io.containerd.internal.v1
time="2024-05-17T17:25:09.011288181Z" level=info msg="loading plugin \"io.containerd.grpc.v1.sandbox-controllers\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011296829Z" level=info msg="loading plugin \"io.containerd.grpc.v1.sandboxes\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011304315Z" level=info msg="loading plugin \"io.containerd.grpc.v1.snapshots\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011312018Z" level=info msg="loading plugin \"io.containerd.grpc.v1.streaming\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011320285Z" level=info msg="loading plugin \"io.containerd.grpc.v1.tasks\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011328474Z" level=info msg="loading plugin \"io.containerd.grpc.v1.transfer\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011336703Z" level=info msg="loading plugin \"io.containerd.grpc.v1.version\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011343482Z" level=info msg="loading plugin \"io.containerd.grpc.v1.cri\"..." type=io.containerd.grpc.v1
time="2024-05-17T17:25:09.011508423Z" level=info msg="Start cri plugin with config {PluginConfig:{ContainerdConfig:{Snapshotter:overlayfs DefaultRuntimeName:runc DefaultRuntime:{Type: Path: Engine: PodAnnotations:[] ContainerAnnotations:[] Root: Options:map[] PrivilegedWithoutHostDevices:false PrivilegedWithoutHostDevicesAllDevicesAllowed:false BaseRuntimeSpec: NetworkPluginConfDir: NetworkPluginMaxConfNum:0 Snapshotter: SandboxMode:} UntrustedWorkloadRuntime:{Type: Path: Engine: PodAnnotations:[] ContainerAnnotations:[] Root: Options:map[] PrivilegedWithoutHostDevices:false PrivilegedWithoutHostDevicesAllDevicesAllowed:false BaseRuntimeSpec: NetworkPluginConfDir: NetworkPluginMaxConfNum:0 Snapshotter: SandboxMode:} Runtimes:map[crun:{Type:io.containerd.runc.v2 Path: Engine: PodAnnotations:[] ContainerAnnotations:[] Root: Options:map[BinaryName:/usr/bin/crun SystemdCgroup:false] PrivilegedWithoutHostDevices:false PrivilegedWithoutHostDevicesAllDevicesAllowed:false BaseRuntimeSpec: NetworkPluginConfDir: NetworkPluginMaxConfNum:0 Snapshotter: SandboxMode:podsandbox} runc:{Type:io.containerd.runc.v2 Path: Engine: PodAnnotations:[] ContainerAnnotations:[] Root: Options:map[SystemdCgroup:false] PrivilegedWithoutHostDevices:false PrivilegedWithoutHostDevicesAllDevicesAllowed:false BaseRuntimeSpec: NetworkPluginConfDir: NetworkPluginMaxConfNum:0 Snapshotter: SandboxMode:podsandbox}] NoPivot:false DisableSnapshotAnnotations:true DiscardUnpackedLayers:false IgnoreBlockIONotEnabledErrors:false IgnoreRdtNotEnabledErrors:false} CniConfig:{NetworkPluginBinDir:/opt/cni/bin NetworkPluginConfDir:/etc/cni/net.d NetworkPluginMaxConfNum:1 NetworkPluginSetupSerially:false NetworkPluginConfTemplate: IPPreference:} Registry:{ConfigPath: Mirrors:map[] Configs:map[] Auths:map[] Headers:map[]} ImageDecryption:{KeyModel:node} DisableTCPService:true StreamServerAddress:127.0.0.1 StreamServerPort:10010 StreamIdleTimeout:4h0m0s EnableSelinux:false SelinuxCategoryRange:1024 SandboxImage:index.docker.io/rancher/mirrored-pause:3.6 StatsCollectPeriod:10 SystemdCgroup:false EnableTLSStreaming:false X509KeyPairStreaming:{TLSCertFile: TLSKeyFile:} MaxContainerLogLineSize:16384 DisableCgroup:true DisableApparmor:true RestrictOOMScoreAdj:true MaxConcurrentDownloads:3 DisableProcMount:false UnsetSeccompProfile: TolerateMissingHugetlbController:true DisableHugetlbController:true DeviceOwnershipFromSecurityContext:false IgnoreImageDefinedVolumes:false NetNSMountsUnderStateDir:false EnableUnprivilegedPorts:true EnableUnprivilegedICMP:true EnableCDI:false CDISpecDirs:[/etc/cdi /var/run/cdi] ImagePullProgressTimeout:5m0s DrainExecSyncIOTimeout:0s} ContainerdRootDir:/var/lib/rancher/rke2/agent/containerd ContainerdEndpoint:/run/k3s/containerd/containerd.sock RootDir:/var/lib/rancher/rke2/agent/containerd/io.containerd.grpc.v1.cri StateDir:/run/k3s/containerd/io.containerd.grpc.v1.cri}"
time="2024-05-17T17:25:09.011545612Z" level=info msg="Connect containerd service"
time="2024-05-17T17:25:09.011564963Z" level=info msg="using legacy CRI server"
time="2024-05-17T17:25:09.011569354Z" level=info msg="using experimental NRI integration - disable nri plugin to prevent this"
time="2024-05-17T17:25:09.011585312Z" level=info msg="Get image filesystem path \"/var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.overlayfs\""
time="2024-05-17T17:25:09.011898668Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
time="2024-05-17T17:25:09.011921213Z" level=info msg="loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." type=io.containerd.tracing.processor.v1
time="2024-05-17T17:25:09.011931496Z" level=info msg="skip loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." error="no OpenTelemetry endpoint: skip plugin" type=io.containerd.tracing.processor.v1
time="2024-05-17T17:25:09.011939008Z" level=info msg="loading plugin \"io.containerd.internal.v1.tracing\"..." type=io.containerd.internal.v1
time="2024-05-17T17:25:09.011945596Z" level=info msg="skipping tracing processor initialization (no tracing plugin)" error="no OpenTelemetry endpoint: skip plugin"
time="2024-05-17T17:25:09.012019586Z" level=info msg="Start subscribing containerd event"
time="2024-05-17T17:25:09.012073412Z" level=info msg="Start recovering state"
time="2024-05-17T17:25:09.012113650Z" level=info msg=serving... address=/run/k3s/containerd/containerd.sock.ttrpc
time="2024-05-17T17:25:09.012124904Z" level=info msg="Start event monitor"
time="2024-05-17T17:25:09.012137613Z" level=info msg=serving... address=/run/k3s/containerd/containerd.sock
time="2024-05-17T17:25:09.012157611Z" level=info msg="Start snapshots syncer"
time="2024-05-17T17:25:09.012168517Z" level=info msg="Start cni network conf syncer for default"
time="2024-05-17T17:25:09.012175658Z" level=info msg="Start streaming server"
time="2024-05-17T17:25:09.012198908Z" level=info msg="containerd successfully booted in 0.034005s"
time="2024-05-17T17:25:39.885678825Z" level=info msg="ImageCreate event name:\"docker.io/rancher/mirrored-pause:3.6\""
time="2024-05-17T17:25:39.887937512Z" level=info msg="ImageCreate event name:\"docker.io/rancher/nginx-ingress-controller:nginx-1.9.3-hardened1\""
time="2024-05-17T17:25:39.889708677Z" level=info msg="ImageCreate event name:\"sha256:6270bb605e12e581514ada5fd5b3216f727db55dc87d5889c790e4c760683fee\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.889909715Z" level=info msg="ImageCreate event name:\"docker.io/rancher/mirrored-sig-storage-snapshot-controller:v6.2.1\""
time="2024-05-17T17:25:39.890648876Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/mirrored-pause:3.6\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.890858220Z" level=info msg="ImageCreate event name:\"docker.io/rancher/hardened-dns-node-cache:1.22.20-build20231010\""
time="2024-05-17T17:25:39.891746006Z" level=info msg="ImageCreate event name:\"sha256:bfdece8fa3f1449a2b25c12f3e375c57258a6cd4d925f7983177f5f652afc885\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.892528556Z" level=info msg="ImageCreate event name:\"docker.io/rancher/klipper-helm:v0.8.2-build20230815\""
time="2024-05-17T17:25:39.893377005Z" level=info msg="ImageCreate event name:\"docker.io/rancher/klipper-lb:v0.4.4\""
time="2024-05-17T17:25:39.894236733Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/nginx-ingress-controller:nginx-1.9.3-hardened1\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.894961867Z" level=info msg="ImageCreate event name:\"docker.io/rancher/hardened-kubernetes:v1.29.0-rke2r1-build20231213\""
time="2024-05-17T17:25:39.895785614Z" level=info msg="ImageCreate event name:\"docker.io/rancher/hardened-etcd:v3.5.9-k3s1-build20230802\""
time="2024-05-17T17:25:39.896500868Z" level=info msg="ImageCreate event name:\"docker.io/rancher/hardened-k8s-metrics-server:v0.6.3-build20231009\""
time="2024-05-17T17:25:39.897213489Z" level=info msg="ImageCreate event name:\"sha256:1ef6c138bd5f2ac45f7b4ee54db0e513efad8576909ae9829ba649fb4b067388\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.897430827Z" level=info msg="ImageCreate event name:\"docker.io/rancher/rke2-runtime:v1.29.0-rke2r1\""
time="2024-05-17T17:25:39.897881148Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/mirrored-sig-storage-snapshot-controller:v6.2.1\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.898105338Z" level=info msg="ImageCreate event name:\"docker.io/rancher/hardened-coredns:v1.10.1-build20231009\""
time="2024-05-17T17:25:39.898776753Z" level=info msg="ImageCreate event name:\"docker.io/rancher/rke2-cloud-provider:v1.28.2-build20231016\""
time="2024-05-17T17:25:39.899440922Z" level=info msg="ImageCreate event name:\"sha256:760c20ed3d8830c73e49f33e855b2c5c4a9ac37cac1a4965334ae875839f206c\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.899770265Z" level=info msg="ImageCreate event name:\"docker.io/rancher/hardened-cluster-autoscaler:v1.8.6-build20231009\""
time="2024-05-17T17:25:39.900473331Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/hardened-dns-node-cache:1.22.20-build20231010\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.900801503Z" level=info msg="ImageCreate event name:\"docker.io/rancher/mirrored-ingress-nginx-kube-webhook-certgen:v20230312-helm-chart-4.5.2-28-g66a760794\""
time="2024-05-17T17:25:39.901350607Z" level=info msg="ImageCreate event name:\"docker.io/rancher/mirrored-sig-storage-snapshot-validation-webhook:v6.2.2\""
time="2024-05-17T17:25:39.902161691Z" level=info msg="ImageCreate event name:\"sha256:5f89cb8137ccbd39377d91b9d75faf4ec4ee0a2d2a3a63635535b10c69c935fa\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.902453902Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/klipper-helm:v0.8.2-build20230815\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.902736405Z" level=info msg="ImageCreate event name:\"sha256:af74bd845c4a83b7e4fa48e0c5a91dcda8843f586794fbb8b7f4bb7ed9e8cc56\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.903010526Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/klipper-lb:v0.4.4\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.903302675Z" level=info msg="ImageCreate event name:\"sha256:ab7fea56055ba0fa4a1840780b4d52ae7143945acf79dde1069edeeffa9461c6\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.903657387Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/hardened-kubernetes:v1.29.0-rke2r1-build20231213\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.903977464Z" level=info msg="ImageCreate event name:\"sha256:c6b7a4f2f79b24f9310e769ce7c1e0caba47fbf2d03a2025b19bee2090dae94d\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.904344077Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/hardened-etcd:v3.5.9-k3s1-build20230802\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.904704178Z" level=info msg="ImageCreate event name:\"sha256:9051c01da7c50cf471beced1c737e13984760ddaf7288ab52f852a7009e95246\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.904968240Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/hardened-k8s-metrics-server:v0.6.3-build20231009\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.905227716Z" level=info msg="ImageCreate event name:\"sha256:4f24500e130755055472a1d5d0cf3d76c782d3ac86dce1bcb45d037a8b31a3f5\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.905384556Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/rke2-runtime:v1.29.0-rke2r1\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.905550419Z" level=info msg="ImageCreate event name:\"sha256:99513a1cabd6fe947a3df0eb03f3d22ef8fbfba91e555d78959eefd20201de65\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.905815524Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/hardened-coredns:v1.10.1-build20231009\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.906097146Z" level=info msg="ImageCreate event name:\"sha256:00bad7e3c2263e3ef9e6231031a05929d9e2a56c6b87c89ad8cae6cc8ef8befe\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.906350885Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/rke2-cloud-provider:v1.28.2-build20231016\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.906597685Z" level=info msg="ImageCreate event name:\"sha256:1b5aedc400f2a025e0bb0eae8ff29226ea759831d02c1afce4dababc607a9c74\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.906856905Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/hardened-cluster-autoscaler:v1.8.6-build20231009\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.907111737Z" level=info msg="ImageCreate event name:\"sha256:5a86b03a88d2316e2317c2576449a95ddbd105d69b2fe7b01d667b0ebab37422\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.907308590Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/mirrored-ingress-nginx-kube-webhook-certgen:v20230312-helm-chart-4.5.2-28-g66a760794\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.907513955Z" level=info msg="ImageCreate event name:\"sha256:ff52c2bcf9f8893ac479bade578b25e9f4315173bcba6f605ca94a4c7ab84235\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"
time="2024-05-17T17:25:39.907869479Z" level=info msg="ImageUpdate event name:\"docker.io/rancher/mirrored-sig-storage-snapshot-validation-webhook:v6.2.2\"  labels:{key:\"io.cri-containerd.image\"  value:\"managed\"}"

brandond commented 6 months ago

Please attach the complete rke2-server log from journald, as well as the kubelet log. The information you've provided is not sufficient to determine why pods aren't being started by the kubelet.

agaxprex commented 6 months ago

kubelet.log

Flag --volume-plugin-dir has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --file-check-frequency has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --sync-frequency has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --address has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --authentication-token-webhook has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --authorization-mode has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --client-ca-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cluster-dns has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cluster-domain has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
Flag --eviction-hard has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --eviction-minimum-reclaim has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --fail-swap-on has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --feature-gates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --healthz-bind-address has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --kubelet-cgroups has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.
Flag --pod-manifest-path has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --read-only-port has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --serialize-image-pulls has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --tls-cert-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --tls-private-key-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
I0517 21:05:16.356981   28157 server.go:204] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime"
I0517 21:05:16.358658   28157 server.go:487] "Kubelet version" kubeletVersion="v1.29.0+rke2r1"
I0517 21:05:16.358670   28157 server.go:489] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I0517 21:05:16.360876   28157 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/var/lib/rancher/rke2/agent/client-ca.crt"
I0517 21:05:16.388790   28157 server.go:745] "--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /"
I0517 21:05:16.389246   28157 container_manager_linux.go:265] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
I0517 21:05:16.389354   28157 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={"RuntimeCgroupsName":"","SystemCgroupsName":"","KubeletCgroupsName":"/rke2","KubeletOOMScoreAdj":-999,"ContainerRuntime":"","CgroupsPerQOS":true,"CgroupRoot":"/","CgroupDriver":"cgroupfs","KubeletRootDir":"/var/lib/kubelet","ProtectKernelDefaults":false,"KubeReservedCgroupName":"","SystemReservedCgroupName":"","ReservedSystemCPUs":{},"EnforceNodeAllocatable":{"pods":{}},"KubeReserved":null,"SystemReserved":null,"HardEvictionThresholds":[{"Signal":"imagefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null},{"Signal":"nodefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null}],"QOSReserved":{},"CPUManagerPolicy":"none","CPUManagerPolicyOptions":null,"TopologyManagerScope":"container","CPUManagerReconcilePeriod":10000000000,"ExperimentalMemoryManagerPolicy":"None","ExperimentalMemoryManagerReservedMemory":null,"PodPidsLimit":-1,"EnforceCPULimits":true,"CPUCFSQuotaPeriod":100000000,"TopologyManagerPolicy":"none","TopologyManagerPolicyOptions":null}
I0517 21:05:16.389368   28157 topology_manager.go:138] "Creating topology manager with none policy"
I0517 21:05:16.389375   28157 container_manager_linux.go:301] "Creating device plugin manager"
I0517 21:05:16.389394   28157 state_mem.go:36] "Initialized new in-memory state store"
I0517 21:05:16.589790   28157 server.go:863] "Failed to ApplyOOMScoreAdj" err="write /proc/self/oom_score_adj: permission denied"
I0517 21:05:16.589837   28157 kubelet.go:396] "Attempting to sync node with API server"
I0517 21:05:16.589848   28157 kubelet.go:301] "Adding static pod path" path="/var/lib/rancher/rke2/agent/pod-manifests"
I0517 21:05:16.589863   28157 kubelet.go:312] "Adding apiserver pod source"
I0517 21:05:16.589873   28157 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
I0517 21:05:16.590147   28157 kuberuntime_manager.go:258] "Container runtime initialized" containerRuntime="containerd" version="v1.7.11-k3s2" apiVersion="v1"
W0517 21:05:16.590165   28157 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.Node: Get "https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3Dcontrol-node-0&limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0517 21:05:16.590198   28157 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3Dcontrol-node-0&limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
W0517 21:05:16.590235   28157 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0517 21:05:16.590260   28157 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
I0517 21:05:16.590260   28157 kubelet.go:809] "Not starting ClusterTrustBundle informer because we are in static kubelet mode"
E0517 21:05:16.590500   28157 server.go:1245] "Failed to set rlimit on max file handles" err="operation not permitted"
I0517 21:05:16.590509   28157 server.go:1256] "Started kubelet"
I0517 21:05:16.590591   28157 server.go:162] "Starting to listen" address="0.0.0.0" port=10250
I0517 21:05:16.590739   28157 ratelimit.go:55] "Setting rate limiting for endpoint" service="podresources" qps=100 burstTokens=10
E0517 21:05:16.590781   28157 event.go:355] "Unable to write event (may retry after sleeping)" err="Post \"https://127.0.0.1:6443/api/v1/namespaces/default/events\": dial tcp 127.0.0.1:6443: connect: connection refused" event="&Event{ObjectMeta:{control-node-0.17d062a47a149f57  default    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:control-node-0,UID:control-node-0,APIVersion:,ResourceVersion:,FieldPath:,},Reason:Starting,Message:Starting kubelet.,Source:EventSource{Component:kubelet,Host:control-node-0,},FirstTimestamp:2024-05-17 21:05:16.590489431 +0000 UTC m=+0.259499249,LastTimestamp:2024-05-17 21:05:16.590489431 +0000 UTC m=+0.259499249,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:kubelet,ReportingInstance:control-node-0,}"
I0517 21:05:16.590824   28157 server.go:233] "Starting to serve the podresources API" endpoint="unix:/var/lib/kubelet/pod-resources/kubelet.sock"
E0517 21:05:16.591404   28157 kubelet.go:1462] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
I0517 21:05:16.591659   28157 server.go:461] "Adding debug handlers to kubelet server"
I0517 21:05:16.592274   28157 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
I0517 21:05:16.592380   28157 volume_manager.go:291] "Starting Kubelet Volume Manager"
I0517 21:05:16.592431   28157 desired_state_of_world_populator.go:151] "Desired state populator starts to run"
I0517 21:05:16.593286   28157 reconciler_new.go:29] "Reconciler: start to sync state"
I0517 21:05:16.593317   28157 factory.go:221] Registration of the systemd container factory successfully
I0517 21:05:16.593388   28157 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
E0517 21:05:16.593796   28157 controller.go:145] "Failed to ensure lease exists, will retry" err="Get \"https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/control-node-0?timeout=10s\": dial tcp 127.0.0.1:6443: connect: connection refused" interval="200ms"
W0517 21:05:16.593567   28157 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.CSIDriver: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0517 21:05:16.593819   28157 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
I0517 21:05:16.594315   28157 factory.go:221] Registration of the containerd container factory successfully
I0517 21:05:16.600648   28157 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv4"
I0517 21:05:16.601721   28157 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv6"
I0517 21:05:16.601738   28157 status_manager.go:217] "Starting to sync pod status with apiserver"
I0517 21:05:16.601749   28157 kubelet.go:2329] "Starting kubelet main sync loop"
E0517 21:05:16.601772   28157 kubelet.go:2353] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
W0517 21:05:16.601991   28157 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.RuntimeClass: Get "https://127.0.0.1:6443/apis/node.k8s.io/v1/runtimeclasses?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0517 21:05:16.602015   28157 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.RuntimeClass: failed to list *v1.RuntimeClass: Get "https://127.0.0.1:6443/apis/node.k8s.io/v1/runtimeclasses?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
I0517 21:05:16.640298   28157 cpu_manager.go:214] "Starting CPU manager" policy="none"
I0517 21:05:16.640309   28157 cpu_manager.go:215] "Reconciling" reconcilePeriod="10s"
I0517 21:05:16.640322   28157 state_mem.go:36] "Initialized new in-memory state store"
I0517 21:05:16.640421   28157 state_mem.go:88] "Updated default CPUSet" cpuSet=""
I0517 21:05:16.640440   28157 state_mem.go:96] "Updated CPUSet assignments" assignments={}
I0517 21:05:16.640447   28157 policy_none.go:49] "None policy: Start"
I0517 21:05:16.640698   28157 memory_manager.go:170] "Starting memorymanager" policy="None"
I0517 21:05:16.640710   28157 state_mem.go:35] "Initializing new in-memory state store"
I0517 21:05:16.640816   28157 state_mem.go:75] "Updated machine memory state"
E0517 21:05:16.642305   28157 node_container_manager_linux.go:61] "Failed to create cgroup" err="mkdir /sys/fs/cgroup/cpuset/kubepods: permission denied" cgroupName=["kubepods"]
E0517 21:05:16.642316   28157 kubelet.go:1542] "Failed to start ContainerManager" err="mkdir /sys/fs/cgroup/cpuset/kubepods: permission denied"

journalctl -u rke2-server journalctl.txt

Note, I'm running rke2 as a root user, so not sure why permission denied errors occur. SELinux is disabled for the testing of this.

brandond commented 6 months ago

E0517 21:05:16.642305   28157 node_container_manager_linux.go:61] "Failed to create cgroup" err="mkdir /sys/fs/cgroup/cpuset/kubepods: permission denied" cgroupName=["kubepods"]
E0517 21:05:16.642316   28157 kubelet.go:1542] "Failed to start ContainerManager" err="mkdir /sys/fs/cgroup/cpuset/kubepods: permission denied"

Is selinux enabled on this node? If you're running rke2 on an selinux-enabled host, you must install the rke2-selinux package and ensure that the correct contexts are set. Normally this is handled by the rke2 RPMs, but if you're using airgap and a tarball install you will need to do this manually.

This sounds like a duplicate of

https://github.com/rancher/rke2/issues/3888

agaxprex commented 6 months ago

Disabled SELinux permissions, but seems kubelet is still crashing, not having to do with permissions but the I think with the driver instead. The system is enabled with cgroups-v2, would that pervent kubelet from starting ContainerManager?

E0518 19:48:52.148026  733385 cgroup_manager_linux.go:476] cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods.slice/cpu.weight: no such file or directory
E0518 19:48:52.148149  733385 kubelet.go:1542] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"

Full log posted: kubelet.log

brandond commented 6 months ago

Are you missing critical cgroups? What is the output of grep cgroup /proc/mounts and cat /proc/cgroups?

agaxprex commented 6 months ago

Think I'm missing some associated with cpu, not familiar with the ones kubelet requires or how to enable them. Currently using cgroups v2. The system I'm in did previously have K3s running on it, just testing a migration to RKE2. Not sure if K3s altered what was enabled.

$ cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset  0       1041    1
cpu     0       1041    1
cpuacct 0       1041    1
blkio   0       1041    1
memory  0       1041    1
devices 0       1041    1
freezer 0       1041    1
net_cls 0       1041    1
perf_event      0       1041    1
net_prio        0       1041    1
hugetlb 0       1041    1
pids    0       1041    1
rdma    0       1041    1

$ grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0

brandond commented 6 months ago

The system I'm in did previously have K3s running on it, just testing a migration to RKE2.

Have you run the k3s uninstall script and rebooted the node since it was running k3s?

agaxprex commented 6 months ago

Yep was uninstalled via the script and rebooted before installing.

brandond commented 6 months ago

mkdir /sys/fs/cgroup/cpuset/kubepods: permission denied

The failure to create the root cgroup seems to be the critical error. Do you have any other security modules present on this system that would prevent the kubelet from creating new cgroups?

agaxprex commented 6 months ago

I fixed the previous error, but got a new error. I don't see that error in the second kubelet.log I uploaded, rather a new one. Here is the updated one:

Flag --volume-plugin-dir has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --file-check-frequency has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --sync-frequency has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --address has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --authentication-token-webhook has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --authorization-mode has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --client-ca-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cluster-dns has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cluster-domain has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
Flag --eviction-hard has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --eviction-minimum-reclaim has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --fail-swap-on has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --feature-gates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --healthz-bind-address has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.
Flag --pod-manifest-path has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --read-only-port has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --serialize-image-pulls has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --tls-cert-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --tls-private-key-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
I0518 19:48:51.965331  733385 server.go:204] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime"
I0518 19:48:51.967025  733385 server.go:487] "Kubelet version" kubeletVersion="v1.29.0+rke2r1"
I0518 19:48:51.967037  733385 server.go:489] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I0518 19:48:51.968569  733385 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/var/lib/rancher/rke2/agent/client-ca.crt"
I0518 19:48:52.047736  733385 server.go:745] "--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /"
I0518 19:48:52.047849  733385 container_manager_linux.go:265] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
I0518 19:48:52.047972  733385 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={"RuntimeCgroupsName":"","SystemCgroupsName":"","KubeletCgroupsName":"","KubeletOOMScoreAdj":-999,"ContainerRuntime":"","CgroupsPerQOS":true,"CgroupRoot":"/","CgroupDriver":"systemd","KubeletRootDir":"/var/lib/kubelet","ProtectKernelDefaults":false,"KubeReservedCgroupName":"","SystemReservedCgroupName":"","ReservedSystemCPUs":{},"EnforceNodeAllocatable":{"pods":{}},"KubeReserved":null,"SystemReserved":null,"HardEvictionThresholds":[{"Signal":"nodefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null},{"Signal":"imagefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null}],"QOSReserved":{},"CPUManagerPolicy":"none","CPUManagerPolicyOptions":null,"TopologyManagerScope":"container","CPUManagerReconcilePeriod":10000000000,"ExperimentalMemoryManagerPolicy":"None","ExperimentalMemoryManagerReservedMemory":null,"PodPidsLimit":-1,"EnforceCPULimits":true,"CPUCFSQuotaPeriod":100000000,"TopologyManagerPolicy":"none","TopologyManagerPolicyOptions":null}
I0518 19:48:52.047988  733385 topology_manager.go:138] "Creating topology manager with none policy"
I0518 19:48:52.047994  733385 container_manager_linux.go:301] "Creating device plugin manager"
I0518 19:48:52.048013  733385 state_mem.go:36] "Initialized new in-memory state store"
I0518 19:48:52.048087  733385 kubelet.go:396] "Attempting to sync node with API server"
I0518 19:48:52.048097  733385 kubelet.go:301] "Adding static pod path" path="/var/lib/rancher/rke2/agent/pod-manifests"
I0518 19:48:52.048115  733385 kubelet.go:312] "Adding apiserver pod source"
I0518 19:48:52.048125  733385 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
I0518 19:48:52.048474  733385 kuberuntime_manager.go:258] "Container runtime initialized" containerRuntime="containerd" version="v1.7.11-k3s2" apiVersion="v1"
W0518 19:48:52.048473  733385 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0518 19:48:52.048509  733385 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
W0518 19:48:52.048498  733385 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.Node: Get "https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3De6le100&limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0518 19:48:52.048525  733385 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3De6le100&limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
I0518 19:48:52.048582  733385 kubelet.go:809] "Not starting ClusterTrustBundle informer because we are in static kubelet mode"
I0518 19:48:52.048814  733385 server.go:1256] "Started kubelet"
I0518 19:48:52.048850  733385 server.go:162] "Starting to listen" address="0.0.0.0" port=10250
I0518 19:48:52.048880  733385 ratelimit.go:55] "Setting rate limiting for endpoint" service="podresources" qps=100 burstTokens=10
I0518 19:48:52.049000  733385 server.go:233] "Starting to serve the podresources API" endpoint="unix:/var/lib/kubelet/pod-resources/kubelet.sock"
E0518 19:48:52.049121  733385 event.go:355] "Unable to write event (may retry after sleeping)" err="Post \"https://127.0.0.1:6443/api/v1/namespaces/default/events\": dial tcp 127.0.0.1:6443: connect: connection refused" event="&Event{ObjectMeta:{e6le100.17d0ba26624b8df1  default    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:e6le100,UID:e6le100,APIVersion:,ResourceVersion:,FieldPath:,},Reason:Starting,Message:Starting kubelet.,Source:EventSource{Component:kubelet,Host:e6le100,},FirstTimestamp:2024-05-18 19:48:52.048801265 -0400 EDT m=+0.110198176,LastTimestamp:2024-05-18 19:48:52.048801265 -0400 EDT m=+0.110198176,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:kubelet,ReportingInstance:e6le100,}"
I0518 19:48:52.049765  733385 server.go:461] "Adding debug handlers to kubelet server"
E0518 19:48:52.049883  733385 kubelet.go:1462] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
I0518 19:48:52.049890  733385 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
I0518 19:48:52.049929  733385 volume_manager.go:291] "Starting Kubelet Volume Manager"
I0518 19:48:52.049986  733385 desired_state_of_world_populator.go:151] "Desired state populator starts to run"
I0518 19:48:52.050005  733385 reconciler_new.go:29] "Reconciler: start to sync state"
E0518 19:48:52.050212  733385 controller.go:145] "Failed to ensure lease exists, will retry" err="Get \"https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/e6le100?timeout=10s\": dial tcp 127.0.0.1:6443: connect: connection refused" interval="200ms"
I0518 19:48:52.050302  733385 factory.go:221] Registration of the systemd container factory successfully
I0518 19:48:52.050362  733385 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
W0518 19:48:52.050526  733385 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.CSIDriver: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0518 19:48:52.050567  733385 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
I0518 19:48:52.052015  733385 factory.go:221] Registration of the containerd container factory successfully
I0518 19:48:52.057899  733385 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv4"
I0518 19:48:52.058875  733385 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv6"
I0518 19:48:52.058889  733385 status_manager.go:217] "Starting to sync pod status with apiserver"
I0518 19:48:52.058898  733385 kubelet.go:2329] "Starting kubelet main sync loop"
E0518 19:48:52.058928  733385 kubelet.go:2353] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
W0518 19:48:52.059112  733385 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.RuntimeClass: Get "https://127.0.0.1:6443/apis/node.k8s.io/v1/runtimeclasses?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
E0518 19:48:52.059148  733385 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.RuntimeClass: failed to list *v1.RuntimeClass: Get "https://127.0.0.1:6443/apis/node.k8s.io/v1/runtimeclasses?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
I0518 19:48:52.109545  733385 cpu_manager.go:214] "Starting CPU manager" policy="none"
I0518 19:48:52.109556  733385 cpu_manager.go:215] "Reconciling" reconcilePeriod="10s"
I0518 19:48:52.109566  733385 state_mem.go:36] "Initialized new in-memory state store"
I0518 19:48:52.110560  733385 state_mem.go:88] "Updated default CPUSet" cpuSet=""
I0518 19:48:52.110578  733385 state_mem.go:96] "Updated CPUSet assignments" assignments={}
I0518 19:48:52.110584  733385 policy_none.go:49] "None policy: Start"
I0518 19:48:52.110930  733385 memory_manager.go:170] "Starting memorymanager" policy="None"
I0518 19:48:52.110946  733385 state_mem.go:35] "Initializing new in-memory state store"
I0518 19:48:52.111060  733385 state_mem.go:75] "Updated machine memory state"
E0518 19:48:52.148026  733385 cgroup_manager_linux.go:476] cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods.slice/cpu.weight: no such file or directory
E0518 19:48:52.148149  733385 kubelet.go:1542] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"

agaxprex commented 6 months ago

I do notice the following:

I0518 19:48:52.109545  733385 cpu_manager.go:214] "Starting CPU manager" policy="none"
I0518 19:48:52.109556  733385 cpu_manager.go:215] "Reconciling" reconcilePeriod="10s"
I0518 19:48:52.109566  733385 state_mem.go:36] "Initialized new in-memory state store"
I0518 19:48:52.110560  733385 state_mem.go:88] "Updated default CPUSet" cpuSet=""

cpuSet seems to be empty.

brandond commented 6 months ago

That would probably be expected, since you're not using a cpu manager (policy="none")

I see that you're providing an rke2 config file with cni: none and presumably planning to deploy your own CNI, are you by any chance also providing a custom containerd config template or anything else along those lines?

agaxprex commented 6 months ago

cni: none was set just to simply test if Kubelet launches. Eventually planning on launching Multus with Canal or Cilium, but holding off for now. Just running the with default containerd configuration, and it is able to load images. kubelet fails to run them. The problem is ContainerManager fails to start, which causes kubelet to exit, so containers don't launch. Not sure if it is related to missing cgroups, or if there is another way to debug what the following error refers to:

E0518 19:48:52.148149  733385 kubelet.go:1542] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"

brandond commented 6 months ago

It seems like something is going on with the cgroups but I'm not sure what. First the permissions errors and now just complaining that expected cgroup containers don't exist. Are you sure everything's been cleaned off this node? There is nothing else trying to manage the kubelet's expected cgroups? Are you able to reboot again to perhaps get things reset to a known-good state?

agaxprex commented 6 months ago

I do have podman running on the node, but that's the only other component that uses cgroups that's still on the system. Would it be safe to clear out /sys/fs/cgroups and reboot? Not sure if the directory would be recreated.

brandond commented 6 months ago

cgroups don't persist across boots, so you shouldn't need to manually clear anything out.

agaxprex commented 6 months ago

Wondering if I'm missing any kernel parameters that need to be enabled in grub. Will take a look then reboot.

agaxprex commented 6 months ago

Resolved: Apparently I was missing some cgroups. Notably when cgroups v2 was enabled in kernel 4.0 it was missing the cpu controller for the subtrees, meaning kubepods.slice was unable to use it. Swapping the system back to cgroups v1 resolved all issues. Disabled hybrid support as well, as that also caused problems.

rancher / rke2

RKE2 Airgap Install: containerd fails to create and run containers #5923