rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.56k stars 268 forks source link

Static pods do not (re)start after fix issue 3725 #4775

Closed xhejtman closed 1 year ago

xhejtman commented 1 year ago

Environmental Info: RKE2 Version: 1.26.9-rke2r1

https://github.com/rancher/rke2/issues/3725

Node(s) CPU architecture, OS, and Version: Linux kub-b1.priv.cerit-sc.cz 6.2.0-32-generic #32~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 18 10:40:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 5 servers, 30 agents

Describe the bug: it seems that after resolving issue https://github.com/rancher/rke2/issues/3725, static pods (kube-proxy) do not restart after upgrade (from 1.26.7 in my case). It may be connected with static cpu topology manager.

config part:

kubelet-arg:
  - "v=5"
  - 'config=/var/lib/rancher/rke2/agent/etc/kubelet.config'
  - "system-reserved=memory=8Gi"
  - "reserved-cpus=64-127"
  - "max-pods=160"
  - "kube-api-qps=100"
  - "kube-api-burst=100"
  - "cpu-manager-policy=static"
  - "topology-manager-policy=best-effort"

syslog complains about but it may be unrelated: kubepods-besteffort.slice: Failed to set 'cpuset.cpus' attribute on '/kubepods.slice/kubepods-besteffort.slice' to '': No space left on device

I removed /var/lib/kubelet/cpu_manager_state, deleted kube-proxy pod, restarted rke2-agent and it is working now. On some nodes, I had to repeat it several times otherwise kube-proxy was in pending state and kubelet complained that kube-proxy cannot start now.

similar happened for control plane, where restarting rke2-server resolved all static pods.

I added: AllowedCPUs=63,127 to rke2-agent/server systemd unit to limit rke2, kubelet and containerd to those 2 CPUs only but it does not seem to be the cause of this issue.

brandond commented 1 year ago

The issue you linked never prevented the static pods from actually being changed, it just made them appear unchanged if you queried the mirror pods from the apiserver.

It is documented upstream that you have to manually clean up the state files if you change the policy on a node after the kubelet has already started; this is not a defect in rke2.

xhejtman commented 1 year ago

I did not change the state, it was all the time static. Actually, if the state is not cleaned, kubelet won't start. I believe there is some issue after 1.26.7.

xhejtman commented 1 year ago

@brandond we should reopen this issue.

The problem is also with non-static cpu manager. If I upgrade rke2 (e.g. from 1.26.8 to 1.26.9) and reboot the node (without restarting rke2 so that upgrade happen after reboot), it starts old statit pods, like etcd, and tries to start another instance of etcd, however listening port is already blocked by the first instance and the node won't start.

Also, some other pods, e.g. cloud controller manager are not recreated, I am not sure, it they should, though.

brandond commented 1 year ago

it starts old statit pods, like etcd, and tries to start another instance of etcd, however listening port is already blocked by the first instance and the node won't start.

This is already handled by startup checks that manually terminate old apiserver and etcd pods to ensure that there are no port conflicts. It is not possible that there is an old etcd pod left running. ref: https://github.com/rancher/rke2/blob/368ba42666c9664d58bd0a9f7d3d13cd38f6267d/pkg/rke2/spw.go#L26-L30

Can you post the complete logs (journald, and /var/log/pods) from RKE2 on your node? We specifically test upgrades prior to every release, and have not run into any of the issues you're describing.

xhejtman commented 1 year ago

I'm attaching the logs. This time, it stared, it only took like 30 minutes to start. Not sure, if it discloses the real problem or not. The port issue with etcd, I cannot reproduce now. If these logs do not show anything, I can downgrade one cluster back to 1.26.7 and try to reproduce. logs.zip

brandond commented 1 year ago

It looks like originally you'd set things up so that the control-plane components were excluded from running on any of the first 16 cores (CpusetCpus:16-127) - is that correct?

time="2023-09-21T15:45:46.341755815+02:00" level=info msg="CreateContainer within sandbox \"d344b87da415e816753f41b0664360dac98751e635ea37e32f1f7becf2df9994\" for &ContainerMetadata{Name:etcd,Attempt:1,} returns container id \"27ff0f60750efdf78dac408dc71d887b5c3baea28ec80858d791d47a1cac0097\""`

time="2023-09-21T15:45:55.599918259+02:00" level=info msg="UpdateContainerResources for \"27ff0f60750efdf78dac408dc71d887b5c3baea28ec80858d791d47a1cac0097\" with Linux: &LinuxContainerResources{CpuPeriod:0,CpuQuota:0,CpuShares:0,MemoryLimitInBytes:0,OomScoreAdj:0,CpusetCpus:16-127,CpusetMems:,HugepageLimits:[]*HugepageLimit{},Unified:map[string]string{},MemorySwapLimitInBytes:0,} / Windows: nil"

It looks like this is removed later; all cpus are available:

time="2023-09-21T15:46:05.649003245+02:00" level=info msg="UpdateContainerResources for \"27ff0f60750efdf78dac408dc71d887b5c3baea28ec80858d791d47a1cac0097\" with Linux: &LinuxContainerResources{CpuPeriod:0,CpuQuota:0,CpuShares:0,MemoryLimitInBytes:0,OomScoreAdj:0,CpusetCpus:0-127,CpusetMems:,HugepageLimits:[]*HugepageLimit{},Unified:map[string]string{},MemorySwapLimitInBytes:0,} / Windows: nil"

That's a bit odd, but regardless, the pod appears to be running as it is writing logs during this time period. However, rke2 isn't able to see it as running via cri-api. It does show up briefly, but the apiserver static pod is not found, and then the next time we check it is missing again, and this goes on for a while:

Sep 21 15:45:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:45:57+02:00" level=info msg="Pod for etcd is synced"
Sep 21 15:45:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:45:57+02:00" level=info msg="Pod for kube-apiserver not synced (no current running pod found), retrying"
Sep 21 15:46:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:46:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:46:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:46:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:46:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:46:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:47:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:47:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:47:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:47:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:47:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:47:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:48:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:48:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:48:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:48:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:48:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:48:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:49:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:49:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:49:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:49:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:49:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:49:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:50:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:50:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:50:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:50:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:50:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:50:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:51:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:51:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:51:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:51:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:51:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:51:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:52:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:52:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:52:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:52:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:52:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:52:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:53:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:53:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:53:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:53:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:53:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:53:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:54:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:54:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:54:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:54:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:54:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:54:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:55:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:55:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:55:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:55:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:55:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:55:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:56:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:56:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:56:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:56:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:56:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:56:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:57:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:57:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:57:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:57:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:57:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:57:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:58:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:58:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:58:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:58:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:58:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:58:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:59:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:59:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 15:59:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:59:37+02:00" level=info msg="Pod for etcd is synced"
Sep 21 15:59:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:59:37+02:00" level=info msg="Pod for kube-apiserver not synced (no current running pod found), retrying"
Sep 21 15:59:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T15:59:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:00:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:00:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:00:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:00:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:00:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:00:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:01:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:01:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:01:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:01:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:01:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:01:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:02:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:02:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:02:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:02:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:02:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:02:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:03:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:03:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:03:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:03:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:03:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:03:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:04:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:04:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:04:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:04:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:04:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:04:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:05:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:05:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:05:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:05:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:05:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:05:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:06:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:06:17+02:00" level=info msg="Pod for etcd is synced"
Sep 21 16:06:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:06:17+02:00" level=info msg="Pod for kube-apiserver not synced (no current running pod found), retrying"
Sep 21 16:06:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:06:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:06:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:06:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:07:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:07:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:07:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:07:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:07:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:07:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:08:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:08:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:08:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:08:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:08:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:08:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:09:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:09:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:09:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:09:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:09:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:09:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:10:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:10:17+02:00" level=info msg="Pod for etcd is synced"
Sep 21 16:10:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:10:17+02:00" level=info msg="Pod for kube-apiserver not synced (no current running pod found), retrying"
Sep 21 16:10:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:10:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:10:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:10:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:11:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:11:17+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:11:37 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:11:37+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:11:57 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:11:57+02:00" level=info msg="Pod for etcd not synced (no current running pod found), retrying"
Sep 21 16:12:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:12:17+02:00" level=info msg="Pod for etcd is synced"
Sep 21 16:12:17 kub-a6.priv.cerit-sc.cz rke2[7220]: time="2023-09-21T16:12:17+02:00" level=info msg="Pod for kube-apiserver is synced"

I've not seen this before, and I'm not able to reproduce it locally. If you are able to reproduce this, would you mind gathering the output of:

xhejtman commented 1 year ago
root@kub-a9:~# CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/k3s/containerd/containerd.sock" crictl ps -a
CONTAINER           IMAGE               CREATED              STATE               NAME                       ATTEMPT             POD ID              POD
92641967a527d       0450f377e3c61       About a minute ago   Running             scaphandre                 4                   289d380a10a32       scaphandre-8wb7t
a3d48375ec407       5131c4e1af289       About a minute ago   Running             fluent-bit                 6                   1778abd4391be       rancher-logging-root-fluentbit-wrvzh
d9b9ca5dbf391       09ebd98bbb217       About a minute ago   Running             image-cleaner-dind         8                   525e4acbadcc9       binderhub-dev-fleet-image-cleaner-dmv9l
6841404f87296       09ebd98bbb217       About a minute ago   Running             image-cleaner-dind         8                   2a6e71c1a2b10       binderhub-fleet-image-cleaner-qbh72
2800a6c26ec0d       19ae2c1c4f184       2 minutes ago        Running             nodeplugin                 8                   af442f1483f92       csi-cvmfs-cvmfs-csi-nodeplugin-crqfv
b320a51bd2980       720dcdb196378       2 minutes ago        Running             registrar                  8                   af442f1483f92       csi-cvmfs-cvmfs-csi-nodeplugin-crqfv
7b46d752ea1a1       d736edfbfb0cb       2 minutes ago        Running             dind                       8                   310cd798c3184       binderhub-dev-fleet-dind-tzxv6
4f3af453cf5bf       03aab484e2fdc       2 minutes ago        Running             oomkill-exporter           8                   5f8cfdf158236       oomkill-exporter-ntztm
3063f1d04b628       5131c4e1af289       2 minutes ago        Running             fluentbit                  6                   369ca1939f92f       rancher-logging-rke2-journald-aggregator-lh4dq
4d478980c5ab4       d736edfbfb0cb       2 minutes ago        Running             dind                       8                   a3b6bbb464cd5       binderhub-fleet-dind-kpd55
a0494fd2dacfd       8065b798a4d67       2 minutes ago        Running             calico-node                6                   ec558c637ac52       calico-node-kdb7q
f8441216df1d2       0dfc27eb82aec       2 minutes ago        Running             worker                     4                   6798fc9dc5038       gpu-operator-node-feature-discovery-worker-rrdbb
88843ce6122bb       219ee5171f800       2 minutes ago        Running             cleanup                    10                  bae13bc0cd2c9       democratic-nfs-temporary-democratic-csi-node-l6pjt
d13fcfe420fd7       219ee5171f800       2 minutes ago        Running             cleanup                    10                  164c7fa04fcad       democratic-nfs-backup-democratic-csi-node-rptqg
0a6f7a0472159       219ee5171f800       2 minutes ago        Running             cleanup                    10                  808856dc90494       democratic-manual-democratic-csi-node-j55lh
26a325555c448       219ee5171f800       2 minutes ago        Running             cleanup                    10                  4965e5e81fe54       democratic-nfs-democratic-csi-node-xpqbb
b4a6b21fd5293       cb03930a2bd42       2 minutes ago        Running             driver-registrar           10                  4965e5e81fe54       democratic-nfs-democratic-csi-node-xpqbb
277b709978677       cb03930a2bd42       2 minutes ago        Running             driver-registrar           10                  808856dc90494       democratic-manual-democratic-csi-node-j55lh
f8f2bbf40afd8       720dcdb196378       2 minutes ago        Running             driver-registrar           10                  bae13bc0cd2c9       democratic-nfs-temporary-democratic-csi-node-l6pjt
dfc61514e1555       720dcdb196378       2 minutes ago        Running             driver-registrar           10                  164c7fa04fcad       democratic-nfs-backup-democratic-csi-node-rptqg
25f090e66e0f1       ec8c0ae05939c       2 minutes ago        Running             kube-rke2-multus           2                   37a17a20a970b       rke2-multus-ds-m87vt
59af341b21bde       8e3dccd553adc       2 minutes ago        Running             csi-proxy                  10                  808856dc90494       democratic-manual-democratic-csi-node-j55lh
0dd8aefd31a10       8e3dccd553adc       2 minutes ago        Running             csi-proxy                  10                  4965e5e81fe54       democratic-nfs-democratic-csi-node-xpqbb
6c26d2165fa8f       24295910daf03       2 minutes ago        Running             csi-proxy                  10                  bae13bc0cd2c9       democratic-nfs-temporary-democratic-csi-node-l6pjt
124b80831bfaa       24295910daf03       2 minutes ago        Running             csi-proxy                  10                  164c7fa04fcad       democratic-nfs-backup-democratic-csi-node-rptqg
0b74b12a60c36       fded1d6b6dca9       2 minutes ago        Running             csi-driver                 10                  bae13bc0cd2c9       democratic-nfs-temporary-democratic-csi-node-l6pjt
44c8d3a92b0fe       fded1d6b6dca9       2 minutes ago        Running             csi-driver                 10                  164c7fa04fcad       democratic-nfs-backup-democratic-csi-node-rptqg
7b0f3d30dfb9b       fded1d6b6dca9       2 minutes ago        Running             csi-driver                 10                  808856dc90494       democratic-manual-democratic-csi-node-j55lh
b9d537b010ae4       fded1d6b6dca9       2 minutes ago        Running             csi-driver                 10                  4965e5e81fe54       democratic-nfs-democratic-csi-node-xpqbb
0b93394d1d06c       9dee260ef7f59       2 minutes ago        Exited              install-cni                0                   ec558c637ac52       calico-node-kdb7q
b71ea13743544       89c8eea805129       2 minutes ago        Running             liveness-prometheus        3                   02635e4684d50       ceph-csi-rbd-nodeplugin-sq287
4230242fae7c3       cbf6dc777a01e       2 minutes ago        Running             sshfs                      8                   a7cf8417958ad       csi-nodeplugin-sshfs-rsjlz
250257eedc04f       d73bd74b21c56       2 minutes ago        Running             webdav                     10                  41f43af9e7533       csi-nodeplugin-webdav-z8hd2
d7c0d4c9876b9       89c8eea805129       2 minutes ago        Running             csi-rbdplugin              3                   02635e4684d50       ceph-csi-rbd-nodeplugin-sq287
eea4fa09db732       d4055c8648fe3       2 minutes ago        Running             onedata                    7                   89efd1d26fcca       csi-nodeplugin-onedata-f2wzh
7c327da6c6770       57e977dd618a0       2 minutes ago        Running             rke2-whereabouts           2                   f82db99de8bdf       rke2-multus-rke2-whereabouts-vm56g
df5f39050def3       bbd91fd54b288       2 minutes ago        Running             pushprox-client            9                   fc703d03f8b79       pushprox-kube-scheduler-client-jp5vd
79898200e907d       f02acafbf968d       2 minutes ago        Exited              cni-plugins                2                   37a17a20a970b       rke2-multus-ds-m87vt
551de7dbe88bf       092a973bb20ee       2 minutes ago        Exited              flexvol-driver             4                   ec558c637ac52       calico-node-kdb7q
50fc3813a2eeb       bbd91fd54b288       2 minutes ago        Running             pushprox-client            8                   b30d0ff9f7bec       pushprox-kube-controller-manager-client-x4wnl
947325aeb5253       f45c8a305a0bb       2 minutes ago        Running             driver-registrar           3                   02635e4684d50       ceph-csi-rbd-nodeplugin-sq287
f860bbc2ab753       ed5bba5d71b95       2 minutes ago        Running             kube-vip                   12                  ea4424efed6a7       kube-vip-ds-h6l7x
40b5ad1d8cbed       cb03930a2bd42       2 minutes ago        Running             node-driver-registrar      8                   a7cf8417958ad       csi-nodeplugin-sshfs-rsjlz
7d3e35e8eea5c       cb03930a2bd42       2 minutes ago        Running             node-driver-registrar      7                   89efd1d26fcca       csi-nodeplugin-onedata-f2wzh
68212d4602be8       bbd91fd54b288       2 minutes ago        Running             pushprox-client            9                   f610191858119       pushprox-kube-proxy-client-4nlcr
72d80ecb04b79       eac98f8b0c07c       2 minutes ago        Running             node-driver-registrar      11                  41f43af9e7533       csi-nodeplugin-webdav-z8hd2
4f11d64824989       1dbe0e9319764       2 minutes ago        Running             node-exporter              7                   fe89bbb859a55       rancher-monitoring-prometheus-node-exporter-64cjw
b8fa00382f9a2       bbd91fd54b288       2 minutes ago        Running             pushprox-client            10                  4339b392df406       pushprox-kube-etcd-client-4zs4b
4cdf83e91f438       f906d1e7a5774       2 minutes ago        Running             cloud-controller-manager   4                   20a4c4d8c6a43       cloud-controller-manager-kub-a9.priv.cerit-sc.cz
50d7343feb779       8e07469479428       2 minutes ago        Running             kube-controller-manager    4                   d7d78e1c448e5       kube-controller-manager-kub-a9.priv.cerit-sc.cz
5c41b8f5c8007       8e07469479428       2 minutes ago        Running             kube-apiserver             2                   270b9eac4bb05       kube-apiserver-kub-a9.priv.cerit-sc.cz
8b9f4a7e79fbc       8e07469479428       2 minutes ago        Exited              kube-controller-manager    3                   d7d78e1c448e5       kube-controller-manager-kub-a9.priv.cerit-sc.cz
bfc8c0a995282       8e07469479428       2 minutes ago        Running             kube-scheduler             2                   e961bae0675ff       kube-scheduler-kub-a9.priv.cerit-sc.cz
dbce3e9d660fa       c6b7a4f2f79b2       2 minutes ago        Running             etcd                       2                   fd625c7991cbc       etcd-kub-a9.priv.cerit-sc.cz
91d770a6e5a28       f906d1e7a5774       2 minutes ago        Exited              cloud-controller-manager   3                   20a4c4d8c6a43       cloud-controller-manager-kub-a9.priv.cerit-sc.cz
7b7e3979b4dac       09ebd98bbb217       2 days ago           Exited              image-cleaner-dind         7                   cdcbe605fb40d       binderhub-fleet-image-cleaner-qbh72
adc844c5d4c47       19ae2c1c4f184       2 days ago           Exited              nodeplugin                 7                   623826567247d       csi-cvmfs-cvmfs-csi-nodeplugin-crqfv
d726a9d77c63d       720dcdb196378       2 days ago           Exited              registrar                  7                   623826567247d       csi-cvmfs-cvmfs-csi-nodeplugin-crqfv
1ec1e0d247960       5131c4e1af289       2 days ago           Exited              fluent-bit                 5                   5d1b38d7e85bc       rancher-logging-root-fluentbit-wrvzh
3baf10e004e8c       0450f377e3c61       2 days ago           Exited              scaphandre                 3                   c08123af4915a       scaphandre-8wb7t
f277d2a633de1       0dfc27eb82aec       2 days ago           Exited              worker                     3                   2868b72dfd51b       gpu-operator-node-feature-discovery-worker-rrdbb
a60efbfb94944       d736edfbfb0cb       2 days ago           Exited              dind                       7                   58168599fa361       binderhub-fleet-dind-kpd55
1e8e9b7aff56e       09ebd98bbb217       2 days ago           Exited              image-cleaner-dind         7                   6d9a25edb736d       binderhub-dev-fleet-image-cleaner-dmv9l
b1bfb32eb32b7       8065b798a4d67       2 days ago           Exited              calico-node                5                   3b4e3ef8b4807       calico-node-kdb7q
ab443add191b6       5131c4e1af289       2 days ago           Exited              fluentbit                  5                   3196231990855       rancher-logging-rke2-journald-aggregator-lh4dq
537c78efe151a       d736edfbfb0cb       2 days ago           Exited              dind                       7                   5e69a19212cfc       binderhub-dev-fleet-dind-tzxv6
f30bd0cfb4de6       03aab484e2fdc       2 days ago           Exited              oomkill-exporter           7                   28430fc628ad5       oomkill-exporter-ntztm
5fbdffec5ce50       219ee5171f800       2 days ago           Exited              cleanup                    9                   41818d5136d63       democratic-manual-democratic-csi-node-j55lh
acdd1748cb68f       219ee5171f800       2 days ago           Exited              cleanup                    9                   399d9279e8000       democratic-nfs-temporary-democratic-csi-node-l6pjt
c79ff830e147e       219ee5171f800       2 days ago           Exited              cleanup                    9                   61f79a2ef7b13       democratic-nfs-democratic-csi-node-xpqbb
714bb8554d8e7       219ee5171f800       2 days ago           Exited              cleanup                    9                   d85905e40d3ee       democratic-nfs-backup-democratic-csi-node-rptqg
82fea9f384731       720dcdb196378       2 days ago           Exited              driver-registrar           9                   399d9279e8000       democratic-nfs-temporary-democratic-csi-node-l6pjt
b5c6a449039e6       cb03930a2bd42       2 days ago           Exited              driver-registrar           9                   41818d5136d63       democratic-manual-democratic-csi-node-j55lh
18b1b655a9e8a       720dcdb196378       2 days ago           Exited              driver-registrar           9                   d85905e40d3ee       democratic-nfs-backup-democratic-csi-node-rptqg
3fd19dba4b8fc       cb03930a2bd42       2 days ago           Exited              driver-registrar           9                   61f79a2ef7b13       democratic-nfs-democratic-csi-node-xpqbb
311dbe969bf11       ec8c0ae05939c       2 days ago           Exited              kube-rke2-multus           1                   00fe7f03d2f3d       rke2-multus-ds-m87vt
c4fdc2c5d3c39       8e3dccd553adc       2 days ago           Exited              csi-proxy                  9                   41818d5136d63       democratic-manual-democratic-csi-node-j55lh
777f4f4e95298       24295910daf03       2 days ago           Exited              csi-proxy                  9                   d85905e40d3ee       democratic-nfs-backup-democratic-csi-node-rptqg
1facdef95e0f5       8e3dccd553adc       2 days ago           Exited              csi-proxy                  9                   61f79a2ef7b13       democratic-nfs-democratic-csi-node-xpqbb
ab2a2a36cf829       24295910daf03       2 days ago           Exited              csi-proxy                  9                   399d9279e8000       democratic-nfs-temporary-democratic-csi-node-l6pjt
3bcd8d62e027b       fded1d6b6dca9       2 days ago           Exited              csi-driver                 9                   d85905e40d3ee       democratic-nfs-backup-democratic-csi-node-rptqg
67bc34d6e5eb7       fded1d6b6dca9       2 days ago           Exited              csi-driver                 9                   399d9279e8000       democratic-nfs-temporary-democratic-csi-node-l6pjt
e7b04b8fa3639       fded1d6b6dca9       2 days ago           Exited              csi-driver                 9                   61f79a2ef7b13       democratic-nfs-democratic-csi-node-xpqbb
5012177c75d86       fded1d6b6dca9       2 days ago           Exited              csi-driver                 9                   41818d5136d63       democratic-manual-democratic-csi-node-j55lh
96e86a9d99ff5       89c8eea805129       2 days ago           Exited              liveness-prometheus        2                   6813cb0cd66ab       ceph-csi-rbd-nodeplugin-sq287
c21496b89f21e       d73bd74b21c56       2 days ago           Exited              webdav                     9                   a2bc1edf38d07       csi-nodeplugin-webdav-z8hd2
40d1ca6223011       cbf6dc777a01e       2 days ago           Exited              sshfs                      7                   d65efaeb9ec81       csi-nodeplugin-sshfs-rsjlz
6d0715f2cd348       89c8eea805129       2 days ago           Exited              csi-rbdplugin              2                   6813cb0cd66ab       ceph-csi-rbd-nodeplugin-sq287
809834db80376       d4055c8648fe3       2 days ago           Exited              onedata                    6                   a5acc1bd23fb1       csi-nodeplugin-onedata-f2wzh
d52ee4c3d3652       57e977dd618a0       2 days ago           Exited              rke2-whereabouts           1                   183505df60db5       rke2-multus-rke2-whereabouts-vm56g
9da639df63f2d       ed5bba5d71b95       2 days ago           Exited              kube-vip                   11                  4662c7670706a       kube-vip-ds-h6l7x
d9ff8b860a5c5       eac98f8b0c07c       2 days ago           Exited              node-driver-registrar      10                  a2bc1edf38d07       csi-nodeplugin-webdav-z8hd2
9d7eaccc26ab5       f45c8a305a0bb       2 days ago           Exited              driver-registrar           2                   6813cb0cd66ab       ceph-csi-rbd-nodeplugin-sq287
7559162840a49       cb03930a2bd42       2 days ago           Exited              node-driver-registrar      6                   a5acc1bd23fb1       csi-nodeplugin-onedata-f2wzh
f0f36d949972b       cb03930a2bd42       2 days ago           Exited              node-driver-registrar      7                   d65efaeb9ec81       csi-nodeplugin-sshfs-rsjlz
419b80e5e5f11       1dbe0e9319764       2 days ago           Exited              node-exporter              6                   e62de36a39b55       rancher-monitoring-prometheus-node-exporter-64cjw
0c3e76f73ae97       bbd91fd54b288       2 days ago           Exited              pushprox-client            8                   695a5d1f0ffdd       pushprox-kube-proxy-client-4nlcr
4fa670a6293f8       bbd91fd54b288       2 days ago           Exited              pushprox-client            7                   6cd557fe57993       pushprox-kube-controller-manager-client-x4wnl
d1c8e45fe0ddd       bbd91fd54b288       2 days ago           Exited              pushprox-client            9                   b84869ab9c899       pushprox-kube-etcd-client-4zs4b
5145ebb0c3da6       bbd91fd54b288       2 days ago           Exited              pushprox-client            8                   6220782339e27       pushprox-kube-scheduler-client-jp5vd
d81ebd4cfcaa6       8e07469479428       2 days ago           Exited              kube-apiserver             1                   dedaaee943f24       kube-apiserver-kub-a9.priv.cerit-sc.cz
c565e942ea23e       c6b7a4f2f79b2       2 days ago           Exited              etcd                       1                   d5f2838e71227       etcd-kub-a9.priv.cerit-sc.cz
c2af4fd971763       8e07469479428       2 days ago           Exited              kube-scheduler             1                   8e8388f760c52       kube-scheduler-kub-a9.priv.cerit-sc.cz
root@kub-a9:~# CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/k3s/containerd/containerd.sock" crictl pods
POD ID              CREATED             STATE               NAME                                                 NAMESPACE                  ATTEMPT             RUNTIME
525e4acbadcc9       2 minutes ago       Ready               binderhub-dev-fleet-image-cleaner-dmv9l              binderhub-dev-ns           9                   (default)
2a6e71c1a2b10       2 minutes ago       Ready               binderhub-fleet-image-cleaner-qbh72                  binderhub-ns               9                   (default)
1778abd4391be       2 minutes ago       Ready               rancher-logging-root-fluentbit-wrvzh                 cattle-logging-system      6                   (default)
310cd798c3184       2 minutes ago       Ready               binderhub-dev-fleet-dind-tzxv6                       binderhub-dev-ns           8                   (default)
289d380a10a32       2 minutes ago       Ready               scaphandre-8wb7t                                     scaphandre                 4                   (default)
af442f1483f92       2 minutes ago       Ready               csi-cvmfs-cvmfs-csi-nodeplugin-crqfv                 csi-storage                9                   (default)
369ca1939f92f       2 minutes ago       Ready               rancher-logging-rke2-journald-aggregator-lh4dq       cattle-logging-system      6                   (default)
a3b6bbb464cd5       2 minutes ago       Ready               binderhub-fleet-dind-kpd55                           binderhub-ns               8                   (default)
5f8cfdf158236       2 minutes ago       Ready               oomkill-exporter-ntztm                               oom-detector               8                   (default)
6798fc9dc5038       2 minutes ago       Ready               gpu-operator-node-feature-discovery-worker-rrdbb     gpu-operator               4                   (default)
37a17a20a970b       2 minutes ago       Ready               rke2-multus-ds-m87vt                                 kube-system                2                   (default)
fc703d03f8b79       2 minutes ago       Ready               pushprox-kube-scheduler-client-jp5vd                 cattle-monitoring-system   8                   (default)
f82db99de8bdf       2 minutes ago       Ready               rke2-multus-rke2-whereabouts-vm56g                   kube-system                2                   (default)
ec558c637ac52       2 minutes ago       Ready               calico-node-kdb7q                                    calico-system              6                   (default)
b30d0ff9f7bec       2 minutes ago       Ready               pushprox-kube-controller-manager-client-x4wnl        cattle-monitoring-system   8                   (default)
02635e4684d50       2 minutes ago       Ready               ceph-csi-rbd-nodeplugin-sq287                        csi-storage                3                   (default)
ea4424efed6a7       2 minutes ago       Ready               kube-vip-ds-h6l7x                                    kube-system                8                   (default)
89efd1d26fcca       2 minutes ago       Ready               csi-nodeplugin-onedata-f2wzh                         csi-storage                7                   (default)
41f43af9e7533       2 minutes ago       Ready               csi-nodeplugin-webdav-z8hd2                          csi-storage                10                  (default)
a7cf8417958ad       2 minutes ago       Ready               csi-nodeplugin-sshfs-rsjlz                           csi-storage                8                   (default)
f610191858119       2 minutes ago       Ready               pushprox-kube-proxy-client-4nlcr                     cattle-monitoring-system   8                   (default)
164c7fa04fcad       2 minutes ago       Ready               democratic-nfs-backup-democratic-csi-node-rptqg      csi-storage                10                  (default)
fe89bbb859a55       2 minutes ago       Ready               rancher-monitoring-prometheus-node-exporter-64cjw    cattle-monitoring-system   7                   (default)
bae13bc0cd2c9       2 minutes ago       Ready               democratic-nfs-temporary-democratic-csi-node-l6pjt   csi-storage                10                  (default)
808856dc90494       2 minutes ago       Ready               democratic-manual-democratic-csi-node-j55lh          csi-storage                10                  (default)
4965e5e81fe54       2 minutes ago       Ready               democratic-nfs-democratic-csi-node-xpqbb             csi-storage                10                  (default)
4339b392df406       2 minutes ago       Ready               pushprox-kube-etcd-client-4zs4b                      cattle-monitoring-system   10                  (default)
270b9eac4bb05       2 minutes ago       Ready               kube-apiserver-kub-a9.priv.cerit-sc.cz               kube-system                2                   (default)
e961bae0675ff       2 minutes ago       Ready               kube-scheduler-kub-a9.priv.cerit-sc.cz               kube-system                2                   (default)
d7d78e1c448e5       2 minutes ago       Ready               kube-controller-manager-kub-a9.priv.cerit-sc.cz      kube-system                2                   (default)
fd625c7991cbc       2 minutes ago       Ready               etcd-kub-a9.priv.cerit-sc.cz                         kube-system                2                   (default)
20a4c4d8c6a43       2 minutes ago       Ready               cloud-controller-manager-kub-a9.priv.cerit-sc.cz     kube-system                2                   (default)
cdcbe605fb40d       2 days ago          NotReady            binderhub-fleet-image-cleaner-qbh72                  binderhub-ns               8                   (default)
623826567247d       2 days ago          NotReady            csi-cvmfs-cvmfs-csi-nodeplugin-crqfv                 csi-storage                8                   (default)
5d1b38d7e85bc       2 days ago          NotReady            rancher-logging-root-fluentbit-wrvzh                 cattle-logging-system      5                   (default)
6d9a25edb736d       2 days ago          NotReady            binderhub-dev-fleet-image-cleaner-dmv9l              binderhub-dev-ns           8                   (default)
c08123af4915a       2 days ago          NotReady            scaphandre-8wb7t                                     scaphandre                 3                   (default)
3196231990855       2 days ago          NotReady            rancher-logging-rke2-journald-aggregator-lh4dq       cattle-logging-system      5                   (default)
58168599fa361       2 days ago          NotReady            binderhub-fleet-dind-kpd55                           binderhub-ns               7                   (default)
5e69a19212cfc       2 days ago          NotReady            binderhub-dev-fleet-dind-tzxv6                       binderhub-dev-ns           7                   (default)
28430fc628ad5       2 days ago          NotReady            oomkill-exporter-ntztm                               oom-detector               7                   (default)
2868b72dfd51b       2 days ago          NotReady            gpu-operator-node-feature-discovery-worker-rrdbb     gpu-operator               3                   (default)
a2bc1edf38d07       2 days ago          NotReady            csi-nodeplugin-webdav-z8hd2                          csi-storage                9                   (default)
00fe7f03d2f3d       2 days ago          NotReady            rke2-multus-ds-m87vt                                 kube-system                1                   (default)
a5acc1bd23fb1       2 days ago          NotReady            csi-nodeplugin-onedata-f2wzh                         csi-storage                6                   (default)
6813cb0cd66ab       2 days ago          NotReady            ceph-csi-rbd-nodeplugin-sq287                        csi-storage                2                   (default)
4662c7670706a       2 days ago          NotReady            kube-vip-ds-h6l7x                                    kube-system                7                   (default)
d65efaeb9ec81       2 days ago          NotReady            csi-nodeplugin-sshfs-rsjlz                           csi-storage                7                   (default)
d85905e40d3ee       2 days ago          NotReady            democratic-nfs-backup-democratic-csi-node-rptqg      csi-storage                9                   (default)
183505df60db5       2 days ago          NotReady            rke2-multus-rke2-whereabouts-vm56g                   kube-system                1                   (default)
e62de36a39b55       2 days ago          NotReady            rancher-monitoring-prometheus-node-exporter-64cjw    cattle-monitoring-system   6                   (default)
3b4e3ef8b4807       2 days ago          NotReady            calico-node-kdb7q                                    calico-system              5                   (default)
695a5d1f0ffdd       2 days ago          NotReady            pushprox-kube-proxy-client-4nlcr                     cattle-monitoring-system   7                   (default)
6cd557fe57993       2 days ago          NotReady            pushprox-kube-controller-manager-client-x4wnl        cattle-monitoring-system   7                   (default)
b84869ab9c899       2 days ago          NotReady            pushprox-kube-etcd-client-4zs4b                      cattle-monitoring-system   9                   (default)
6220782339e27       2 days ago          NotReady            pushprox-kube-scheduler-client-jp5vd                 cattle-monitoring-system   7                   (default)
41818d5136d63       2 days ago          NotReady            democratic-manual-democratic-csi-node-j55lh          csi-storage                9                   (default)
61f79a2ef7b13       2 days ago          NotReady            democratic-nfs-democratic-csi-node-xpqbb             csi-storage                9                   (default)
399d9279e8000       2 days ago          NotReady            democratic-nfs-temporary-democratic-csi-node-l6pjt   csi-storage                9                   (default)
dedaaee943f24       2 days ago          NotReady            kube-apiserver-kub-a9.priv.cerit-sc.cz               kube-system                1                   (default)
8e8388f760c52       2 days ago          NotReady            kube-scheduler-kub-a9.priv.cerit-sc.cz               kube-system                1                   (default)
d5f2838e71227       2 days ago          NotReady            etcd-kub-a9.priv.cerit-sc.cz                         kube-system                1                   (default)
root@kub-a9:~#
xhejtman commented 1 year ago

It looks like originally you'd set things up so that the control-plane components were excluded from running on any of the first 16 cores (CpusetCpus:16-127) - is that correct?

no, not intentionally.

This is full config (except server and token):

tls-san:
  - kub-a6.priv.cerit-sc.cz
  - kub-a.priv.cerit-sc.cz
tls-san-security: true
disable:
  - rke2-canal
cni: multus,calico

cluster-cidr: 10.42.0.0/16,xxx::8:2::/96

service-cidr: 10.43.0.0/16,xxx:8:3::/112

kube-controller-manager-arg:
        - "node-cidr-mask-size-ipv6=108"

audit-policy-file: "/etc/rancher/rke2/audit-policy.yaml"

kube-apiserver-arg:
        - "encryption-provider-config=/etc/rancher/rke2/encryption-provider-config.json"
        - "audit-log-maxage=30"
        - "audit-log-maxbackup=10"
        - "audit-log-path=/var/log/kube-audit/audit.log"
        - "audit-log-maxsize=100"
        - "audit-log-format=json"
        - "audit-policy-file=/etc/rancher/rke2/audit-policy.yaml"

etcd-s3-access-key: xxx
etcd-s3-secret-key: xxx
etcd-s3: true
etcd-s3-endpoint: xxx
etcd-s3-bucket: k8s-muni
etcd-s3-folder: kub-a-etcd
etcd-snapshot-schedule-cron: '0 */6 * * *'
etcd-snapshot-retention: 10
node-label:
  - "storage=local-ssd"

profile: cis-1.23

kube-scheduler-arg:
  - config=/var/lib/rancher/rke2/server/sched/scheduler-policy-config.yaml
  - bind-address=0.0.0.0

kubelet-arg:
  - "v=5"
  - 'config=/var/lib/rancher/rke2/agent/etc/kubelet.config'
  - "reserved-cpus=64-127"
  - "max-pods=160"
  - "kube-api-qps=100"
  - "kube-api-burst=100"
  - "cpu-manager-policy=static"
  - "topology-manager-policy=best-effort"

etcd-expose-metrics: "true"
kube-proxy-arg:
  - metrics-bind-address=0.0.0.0:10249

node-ip: '10.16.62.15,xxx:253:131'

control-plane-resource-requests: "kube-apiserver-memory=3G,etcd-memory=32G"

pod-security-admission-config-file: /etc/rancher/rke2/rke2-pss-cerit.yaml
xhejtman commented 1 year ago

If you are able to reproduce this, would you mind gathering the output of:

I gathered it in time it tries to start. It seems, I am able to reproduce, not on single nodes, it varies, but I can do it. If I need to gathere it at different time, just tell.

brandond commented 1 year ago
kubelet-arg:
  - "max-pods=160"
  - "reserved-cpus=64-127"
  - "cpu-manager-policy=static"

Did you change the reserved-cpus or cpu manager policy at some point? I will say that we don't do a lot of testing on nodes with 128 cores, or with more than the recommended number of pods per node.

brandond commented 1 year ago

Ohh, I see. There are two pod sandboxes with the same name, and two containers. The old one isn't being cleaned up for some reason.

dbce3e9d660fa       c6b7a4f2f79b2       2 minutes ago        Running             etcd                       2                   fd625c7991cbc       etcd-kub-a9.priv.cerit-sc.cz
c565e942ea23e       c6b7a4f2f79b2       2 days ago           Exited              etcd                       1                   d5f2838e71227       etcd-kub-a9.priv.cerit-sc.cz

fd625c7991cbc       2 minutes ago       Ready               etcd-kub-a9.priv.cerit-sc.cz       
d5f2838e71227       2 days ago          NotReady            etcd-kub-a9.priv.cerit-sc.cz                         

These should get cleaned up after they exit; I'm not sure what would cause them to linger around for 2 days. I've not seen this before, but the code that checks for pods clearly needs to handle this properly.

xhejtman commented 1 year ago
kubelet-arg:
  - "max-pods=160"
  - "reserved-cpus=64-127"
  - "cpu-manager-policy=static"

Did you change the reserved-cpus or cpu manager policy at some point? I will say that we don't do a lot of testing on nodes with 128 cores, or with more than the recommended number of pods per node.

No, I definitely do not change these settings. They are kept all the time.

brandond commented 1 year ago

No, I definitely do not change these settings. They are kept all the time.

Hmm, I'm curious why the kubelet is changing the allowed cpus then. It doesn't match what you've reserved, either.

Can you get info on the duplicate etcd pods?

CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/k3s/containerd/containerd.sock" crictl pods -o yaml --label component=etcd
xhejtman commented 1 year ago

Ohh, I see. There are two pod sandboxes with the same name, and two containers. The old one isn't being cleaned up for some reason.

dbce3e9d660fa       c6b7a4f2f79b2       2 minutes ago        Running             etcd                       2                   fd625c7991cbc       etcd-kub-a9.priv.cerit-sc.cz
c565e942ea23e       c6b7a4f2f79b2       2 days ago           Exited              etcd                       1                   d5f2838e71227       etcd-kub-a9.priv.cerit-sc.cz

fd625c7991cbc       2 minutes ago       Ready               etcd-kub-a9.priv.cerit-sc.cz       
d5f2838e71227       2 days ago          NotReady            etcd-kub-a9.priv.cerit-sc.cz                         

These should get cleaned up after they exit; I'm not sure what would cause them to linger around for 2 days. I've not seen this before, but the code that checks for pods clearly needs to handle this properly.

In this particular case, I saw that kube-proxy has been (re)started after service rke2-server start, but after while, it was deleted and started again after about 20 minutes. Maybe, it just waits for etcd to resolve first.

And also, it came to my mind, do you test service or systemctl? I saw that some services insist to use systemctl as using service, it does not start properly.

xhejtman commented 1 year ago
CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/k3s/containerd/containerd.sock" crictl pods -o yaml --label component=etcd
root@kub-a9:~# CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/k3s/containerd/containerd.sock" crictl pods -o yaml --label component=etcd
items:
- annotations:
    etcd.k3s.io/initial: '{"initial-cluster":"kub-a7.priv.cerit-sc.cz-a8d35e88=https://10.16.62.16:2380,kub-a8.priv.cerit-sc.cz-50f5697e=https://10.16.62.17:2380,kub-a6.priv.cerit-sc.cz-93045237=https://10.16.62.15:2380,kub-a5.priv.cerit-sc.cz-1c560f32=https://10.16.62.14:2380,kub-a9.priv.cerit-sc.cz-e845e62d=https://10.16.62.18:2380","initial-cluster-state":"existing"}'
    kubernetes.io/config.hash: ebbe668bade9a11dd9c3bf2ca745b3cf
    kubernetes.io/config.seen: "2023-09-22T10:55:55.992366782+02:00"
    kubernetes.io/config.source: file
  createdAt: "1695372956804642423"
  id: fd625c7991cbc71917f6d3bf91504ec9be97d31bc6a78fa9bebe3d84cb5aae68
  labels:
    component: etcd
    io.kubernetes.pod.name: etcd-kub-a9.priv.cerit-sc.cz
    io.kubernetes.pod.namespace: kube-system
    io.kubernetes.pod.uid: ebbe668bade9a11dd9c3bf2ca745b3cf
    tier: control-plane
  metadata:
    attempt: 2
    name: etcd-kub-a9.priv.cerit-sc.cz
    namespace: kube-system
    uid: ebbe668bade9a11dd9c3bf2ca745b3cf
  runtimeHandler: ""
  state: SANDBOX_READY
- annotations:
    etcd.k3s.io/initial: '{"initial-cluster":"kub-a7.priv.cerit-sc.cz-a8d35e88=https://10.16.62.16:2380,kub-a8.priv.cerit-sc.cz-50f5697e=https://10.16.62.17:2380,kub-a6.priv.cerit-sc.cz-93045237=https://10.16.62.15:2380,kub-a5.priv.cerit-sc.cz-1c560f32=https://10.16.62.14:2380,kub-a9.priv.cerit-sc.cz-e845e62d=https://10.16.62.18:2380","initial-cluster-state":"existing"}'
    kubernetes.io/config.hash: ebbe668bade9a11dd9c3bf2ca745b3cf
    kubernetes.io/config.seen: "2023-09-20T10:39:46.035747727+02:00"
    kubernetes.io/config.source: file
  createdAt: "1695199186603452337"
  id: d5f2838e71227673be2c93bd69a3cda3c648e1e13b98b00c86432d6e3e2ca01d
  labels:
    component: etcd
    io.kubernetes.pod.name: etcd-kub-a9.priv.cerit-sc.cz
    io.kubernetes.pod.namespace: kube-system
    io.kubernetes.pod.uid: ebbe668bade9a11dd9c3bf2ca745b3cf
    tier: control-plane
  metadata:
    attempt: 1
    name: etcd-kub-a9.priv.cerit-sc.cz
    namespace: kube-system
    uid: ebbe668bade9a11dd9c3bf2ca745b3cf
  runtimeHandler: ""
  state: SANDBOX_NOTREADY
brandond commented 1 year ago

OK, that's interesting. I've not seen the kubelet create multiple sandboxes for the same static pod. We know they're the same because they have the same config hash and pod uid. I guess this will happen if etcd crashes, and the kubelet needs to create a new pod instead of just restarting the current one for whatever reason.

I'll take a look at how to better handle this for our next release.

xhejtman commented 1 year ago

OK, that's interesting. I've not seen the kubelet create multiple sandboxes for the same static pod. We know they're the same because they have the same config hash and pod uid.

I'll take a look at how to better handle this for our next release.

Thank you. Also for the support!

xhejtman commented 1 year ago

not sure, if it still the same problem here, but this is also happening:

root@kub-b9:~# CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/k3s/containerd/containerd.sock" crictl pods -o yaml --label io.kubernetes.pod.name=kube-proxy-kub-b9.priv.cerit-sc.cz
items:
- annotations:
    kubernetes.io/config.hash: ce051fc91f9e463593a1d45efa60be52
    kubernetes.io/config.seen: "2023-10-03T14:17:50.457343231+02:00"
    kubernetes.io/config.source: file
  createdAt: "1696336132298880783"
  id: e505595e4af3ccb8e46c9dda0a79ce997b89c33529600708a389b7874b386ad3
  labels:
    component: kube-proxy
    io.kubernetes.pod.name: kube-proxy-kub-b9.priv.cerit-sc.cz
    io.kubernetes.pod.namespace: kube-system
    io.kubernetes.pod.uid: ce051fc91f9e463593a1d45efa60be52
    tier: control-plane
  metadata:
    attempt: 1
    name: kube-proxy-kub-b9.priv.cerit-sc.cz
    namespace: kube-system
    uid: ce051fc91f9e463593a1d45efa60be52
  runtimeHandler: ""
  state: SANDBOX_READY

root@kub-b9:~#
root@kub-b9:~# CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/k3s/containerd/containerd.sock" crictl stop e505595e4af3c
E1005 20:28:29.770933 1411714 remote_runtime.go:349] "StopContainer from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"e505595e4af3c\": not found" containerID="e505595e4af3c"
FATA[0000] stopping the container "e505595e4af3c": rpc error: code = NotFound desc = an error occurred when try to find container "e505595e4af3c": not found
root@kub-b9:~#
root@kub-b9:~# CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/k3s/containerd/containerd.sock" crictl rmp -f e505595e4af3c
Stopped sandbox e505595e4af3c
Removed sandbox e505595e4af3c
root@kub-b9:~#

however, after service rke2-agent restart:

root@kub-b9:~# CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/k3s/containerd/containerd.sock" crictl pods -o yaml --label io.kubernetes.pod.name=kube-proxy-kub-b9.priv.cerit-sc.cz
items: []

root@kub-b9:~#
root@kub-b9:~# CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/k3s/containerd/containerd.sock" crictl ps -a | grep kube-proxy
root@kub-b9:~#

I really do not know, how to start kube-proxy again. I tried to switch from static cpu manager to none, removing /var/lib/kubelet/cpu* and memory state files. Still the same.

kubelet.log.gz

brandond commented 1 year ago

@xhejtman that does not appear to be related. We don't have anything that specifically checks kube-proxy. Can you open a new issue?

mdrahman-suse commented 1 year ago

Hi @xhejtman, Can you please provide the steps to replicate and validate the issue? For example: Server and agent node count, setup and configuration details? I am trying to validate the issue and having some trouble with replication. TIA!

BTW I do see some info here: https://github.com/rancher/rke2/issues/4775#issue-1903865289 but having some additional steps and details would be really helpful

I was trying with single node (1 server, 1 agent) and HA (3 servers, 1 agent) but I am not able to replicate the issue. Also I was seeing the error below on shared command

$ sudo /var/lib/rancher/rke2/bin/crictl ps -a
WARN[0000] runtime connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
WARN[0000] image connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
E1011 20:14:56.072186   20008 remote_runtime.go:390] "ListContainers with filter from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory\"" filter="&ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},}"
FATA[0000] listing containers: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory"
brandond commented 1 year ago

@mdrahman-suse did you forget to export CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sock before using crictl? This is always necessary on RKE2, as it does not wrap the crictl command for you as K3s does.

mdrahman-suse commented 1 year ago

@mdrahman-suse did you forget to export CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sock before using crictl? This is always necessary on RKE2, as it does not wrap the crictl command for you as K3s does.

Yes that is my bad, I will try again doing that

VestigeJ commented 1 year ago

Just commenting @mdrahman-suse and I were able to reproduce this with a single node t3.xlarge we're going to re-run another pass to be sure it seems to be related two particular config flags.

The behavior seemed to manifest around adding these two particular config flags onto the node and restarting rke2-server

$ get_figs

write-kubeconfig-mode: 644
debug: true
token: YOUR_TOKEN_HERE
disable:
  - rke2-canal
cni: multus,calico
profile: cis-1.23
selinux: true
kubelet-arg:
  - alsologtostderr=true
  - feature-gates=MemoryManager=true
  - kube-reserved=cpu=400m,memory=1Gi
  - system-reserved=cpu=400m,memory=1Gi
  - memory-manager-policy=Static
  - reserved-memory=0:memory=2Gi
  - kube-api-qps=100
  - kube-api-burst=100

$ sudo /var/lib/rancher/rke2/bin/crictl --runtime-endpoint unix:///var/run/k3s/containerd/containerd.sock pods -o yaml --label component=etcd

items:
- annotations:
    etcd.k3s.io/initial: '{"initial-advertise-peer-urls":"https://172.31.26.250:2380","initial-cluster":"ip-172-31-26-250-559323ab=https://172.31.26.250:2380","initial-cluster-state":"new"}'
    kubernetes.io/config.hash: c1a33c0e30bc4a1e98a36d5caa84749f
    kubernetes.io/config.seen: "2023-10-19T19:10:09.642072011Z"
    kubernetes.io/config.source: file
  createdAt: "1697742610850752607"
  id: 4333d9d4bde9a2202db09df671a587e2ad0271a060cba5bddccdfad2a4130c8a
  labels:
    component: etcd
    io.kubernetes.pod.name: etcd-ip-172-31-26-250
    io.kubernetes.pod.namespace: kube-system
    io.kubernetes.pod.uid: c1a33c0e30bc4a1e98a36d5caa84749f
    tier: control-plane
  metadata:
    attempt: 1
    name: etcd-ip-172-31-26-250
    namespace: kube-system
    uid: c1a33c0e30bc4a1e98a36d5caa84749f
  runtimeHandler: ""
  state: SANDBOX_READY
- annotations:
    etcd.k3s.io/initial: '{"initial-advertise-peer-urls":"https://172.31.26.250:2380","initial-cluster":"ip-172-31-26-250-559323ab=https://172.31.26.250:2380","initial-cluster-state":"new"}'
    kubernetes.io/config.hash: c1a33c0e30bc4a1e98a36d5caa84749f
    kubernetes.io/config.seen: "2023-10-19T19:03:30.168309885Z"
    kubernetes.io/config.source: file
  createdAt: "1697742210688067912"
  id: d961341b9d664fa21de6d2db7a7c89f0f3c98f0c5d39c9a199eab6e6f0aa0d63
  labels:
    component: etcd
    io.kubernetes.pod.name: etcd-ip-172-31-26-250
    io.kubernetes.pod.namespace: kube-system
    io.kubernetes.pod.uid: c1a33c0e30bc4a1e98a36d5caa84749f
    tier: control-plane
  metadata:
    attempt: 0
    name: etcd-ip-172-31-26-250
    namespace: kube-system
    uid: c1a33c0e30bc4a1e98a36d5caa84749f
  runtimeHandler: ""
  state: SANDBOX_NOTREADY

$ sudo /var/lib/rancher/rke2/bin/crictl --runtime-endpoint

unix:///var/run/k3s/containerd/containerd.sock ps -a | grep -i etcd
d58e110ce8108       c6b7a4f2f79b2       4 minutes ago       Running             etcd                               1                   4333d9d4bde9a       etcd-ip-172-31-26-250
0c6d03d537906       c6b7a4f2f79b2       11 minutes ago      Exited              etcd                               0                   d961341b9d664       etcd-ip-172-31-26-250
brandond commented 1 year ago

Possibly the same root cause as https://github.com/rancher/rke2/issues/4930 as that seems to be triggered by the timing of the kubelet connecting to the apiserver. Changing the client rate-limiting thresholds would definitely affect that.

The defaults are --kube-api-qps=50 --kube-api-burst=100; I'm curious if you can reproduce by specifying only the qps.

mdrahman-suse commented 1 year ago

@brandond I was able to replicate the same with providing only the qps. Here is the config that I used

write-kubeconfig-mode: 644
debug: true
token: YOUR_TOKEN_HERE
disable:
  - rke2-canal
cni: multus,calico
profile: cis-1.23
selinux: true
kubelet-arg:
  - alsologtostderr=true
  - feature-gates=MemoryManager=true
  - kube-reserved=cpu=400m,memory=1Gi
  - system-reserved=cpu=400m,memory=1Gi
  - memory-manager-policy=Static
  - reserved-memory=0:memory=2Gi
  - kube-api-qps=100

Also I replicated just by rebooting the node. I did not even went with the upgrade. Is that expected?

brandond commented 1 year ago

Is it stuck there, or do you just see that there's an old exited pod?

If this is reproducible just by changing the QPS (and without all the other args) we should have an easy time raising this as an upstream issue.

mdrahman-suse commented 1 year ago

I just see that there is an old exited pod... also when I upgraded after the reboot, I see that the exited pod went away

$ sudo /var/lib/rancher/rke2/bin/crictl --runtime-endpoint unix:///var/run/k3s/containerd/containerd.sock ps -a | grep -i etcd
5f0736c2a2e31       c6b7a4f2f79b2       2 minutes ago        Running             etcd                               0                   8d51d0006908e       etcd-ip-172-31-31-249

And then when I rebooted again, I see

$ sudo /var/lib/rancher/rke2/bin/crictl --runtime-endpoint unix:///var/run/k3s/containerd/containerd.sock ps -a | grep -i etcd
2e04efd53661f       c6b7a4f2f79b2       5 minutes ago       Running             etcd                               1                   fe0c51ff1071f       etcd-ip-172-31-31-249
5f0736c2a2e31       c6b7a4f2f79b2       10 minutes ago      Exited              etcd                               0                   8d51d0006908e       etcd-ip-172-31-31-249

Trying with basic setup without the args and see if I can replicate the issue

UPDATE: And I am able to replicate with the basic hardened setup without even providing the kubelet-args

$ sudo /var/lib/rancher/rke2/bin/crictl --runtime-endpoint unix:///var/run/k3s/containerd/containerd.sock ps -a | grep -i etcd
49ebf21d1a687       c6b7a4f2f79b2       About a minute ago   Running             etcd                               1                   b0ee856a760e9       etcd-ip-172-31-31-249
1b69a787ce13d       c6b7a4f2f79b2       12 minutes ago       Exited              etcd                               0                   53d943b6507de       etcd-ip-172-31-31-249
brandond commented 1 year ago

Seeing old exited pods in the crictl output is fine and normal. The problem is just when the kubelet fails to start a new pod.

mdrahman-suse commented 1 year ago

Seeing old exited pods in the crictl output is fine and normal. The problem is just when the kubelet fails to start a new pod.

Yeah that I am not seeing happening.

mdrahman-suse commented 1 year ago

Unable to replicate the issue, hence testing has been done as part of Oct 2023 Patch Validation and observed no regression. If this or similar issue re-surfaces, it can be tracked separately or can be reopened.