Closed RH-csaggin closed 1 week ago
This issue comes from the configuration of the Linux cgroup version on the nodes:
oc get nodes.config/cluster -oyaml | grep cgroup
cgroupMode: v2
This change forces the creation of new machineconfig which is pending to be applied due to the inability to drain the node, which is caused by 2 different reasons:
Two PODs are stacked in a terminating state:
I0925 08:06:40.260894 1 drain_controller.go:182] node wrk-0: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: [error when waiting for pod "logs-backing-store-noobaa-pod-29539f6f" in namespace "openshift-storage" to terminate: global timeout reached: 1m30s, error when waiting for pod "logsarchive-backing-store-noobaa-pod-c62b9b95" in namespace "openshift-storage" to terminate: global timeout reached: 1m30s, error when evicting pods/"logging-loki-ingester-0" -n "openshift-logging": global timeout reached: 1m30s]
Typical PDB which prevents PODs from being evicted:
E0925 07:58:56.289621 1 drain_controller.go:152] error when evicting pods/"logging-loki-ingester-0" -n "openshift-logging" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0925 07:58:56.489347 1 drain_controller.go:152] error when evicting pods/"postgres-postgres-5zr7-0" -n "keycloak" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
This is affecting the other PODs running on wrk-0
that are forced to be evicted in a loop leaving the cluster unstable, to fix this we need to force the node to be drained and as ODF is running externally the cluster there is no expected impact.
$ oc adm drain <node> --delete-emptydir-data --ignore-daemonsets --force --disable-eviction
The forced drain will allow the node to reboot with the new render config and then move to the other node, is suggested to monitor this process until all the nodes are updated with the new configuration due to the possible need to force again other nodes.
status: wrk-0 stucked in not ready
@computate @schwesig @RH-csaggin
Is wrk-0 currently up and running, I know it is showing not ready
in openshift but is the machine itself up. We had an issue on another cluster where the machine itself was failing to boot. It seems like the machine is unreachable so I would assume it's not up. @computate @schwesig @RH-csaggin
[tsalawu@tsalawu-thinkpadx1nanogen2 mocesi]$ oc debug node/wrk-1 --as system:admin
Starting pod/wrk-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.30.9.21
If you don't see a command prompt, try pressing enter.
sh-4.4# ping 10.30.9.20 <--- internal ip for wrk-0
PING 10.30.9.20 (10.30.9.20) 56(84) bytes of data.
From 10.30.9.21 icmp_seq=1 Destination Host Unreachable
From 10.30.9.21 icmp_seq=2 Destination Host Unreachable
From 10.30.9.21 icmp_seq=3 Destination Host Unreachable
From 10.30.9.21 icmp_seq=4 Destination Host Unreachable
^C
--- 10.30.9.20 ping statistics ---
8 packets transmitted, 0 received, +4 errors, 100% packet loss, time 7147ms
Just for contrast I am able to ping other nodes in the cluster:
[tsalawu@tsalawu-thinkpadx1nanogen2 mocesi]$ oc debug node/wrk-1 --as system:admin
Starting pod/wrk-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.30.9.21
If you don't see a command prompt, try pressing enter.
sh-4.4# ping 10.30.9.22 <---- internal ip for wrk-2
PING 10.30.9.22 (10.30.9.22) 56(84) bytes of data.
64 bytes from 10.30.9.22: icmp_seq=1 ttl=64 time=1.99 ms
64 bytes from 10.30.9.22: icmp_seq=2 ttl=64 time=0.157 ms
64 bytes from 10.30.9.22: icmp_seq=3 ttl=64 time=0.149 ms
64 bytes from 10.30.9.22: icmp_seq=4 ttl=64 time=0.111 ms
^C
--- 10.30.9.22 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3079ms
@tssala23 yes, still Not Ready
https://console-openshift-console.apps.obs.nerc.mghpcc.org/k8s/cluster/core~v1~Node
@schwesig I know I'm saying the issue might be with the machine itself e.g. this issue caused one of the nodes on my cluster to not boot. Openshift just say not ready
but that's not really that useful. I don't have access to the interface to look at the machines Hakan, Naved or Lars would be able to check the state of that node for you.
@tssala23 ah, ok, got it now. OK, thanks for the connection to the other issue
/CC @RH-csaggin
Node wrk-0
is experiencing memory errors:
$ curl -sSk -u '...' https://10.30.0.86/redfish/v1/Managers/iDRAC.Embedded.1/Logs/Sel |
jq -r '.Members[]|select(.Severity == "Critical")|[.Created,.Message]|@tsv' |
grep 2024-09
2024-09-25T09:32:08-05:00 Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
2024-09-25T09:32:08-05:00 Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
This prevents the node from booting; it is waiting at what is effectively a "press a key to continue" prompt:
On a system that experienced a similar problem yesterday, we performed a cold boot and the problem didn't crop up again, but I requested memory replacement for the node. I'll try the same thing here.
It looks as if the node has booted successfully after a cold boot.
wrk-0
now reports ready and wrk-1
is in the process of updating and rebooting.
wrk-0 up and running, thanks @larsks
@larsks I DIDN'T TOUCH THAT NODE!!! So I can't be the commonality between breaking nodes any more!!! Taj breaks nodes theory has be debunked
but united still breaks guitars
oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-924acdcf5c6090a539156dc7ff78a6e0 True False False 3 3 3 0 281d
worker rendered-worker-9926a853c2141d58f4bca57e8aea6904 False True False 3 2 2 0 281d
I0925 15:43:22.376014 1 drain_controller.go:152] evicting pod openshift-logging/logging-loki-ingester-0
E0925 15:43:22.383871 1 drain_controller.go:152] error when evicting pods/"logging-loki-ingester-0" -n "openshift-logging" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
wrk-1 will not move if we do not force the deletion of the logging-loki-ingester-0
or force the drain.
Pod deleted:
oc delete pod logging-loki-ingester-0 -n openshift-logging --as system:admin
pod "logging-loki-ingester-0" deleted
wrk-1 rebooted and now the cluster is stable and updated:
oc get node
NAME STATUS ROLES AGE VERSION
ctl-0 Ready control-plane,master 281d v1.28.10+a2c84a5
ctl-1 Ready control-plane,master 281d v1.28.10+a2c84a5
ctl-2 Ready control-plane,master 281d v1.28.10+a2c84a5
wrk-0 Ready worker 281d v1.28.10+a2c84a5
wrk-1 Ready worker 281d v1.28.10+a2c84a5
wrk-2 Ready worker 281d v1.28.10+a2c84a5
❯ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-924acdcf5c6090a539156dc7ff78a6e0 True False False 3 3 3 0 281d
worker rendered-worker-dc3df60ed4593cf6bf70da00e5ea83e8 True False False 3 3 3 0 281d
I think we can close this? @RH-csaggin @computate the issue for the Bad memory is in 1394, is this covering it?
OK @schwesig we can close.