nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

obs cluster degraded state due to worker no able to be drain to apply machineconfig #746

Closed RH-csaggin closed 1 week ago

RH-csaggin commented 1 month ago
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config                             4.15.16   True        False         True       28d     Failed to resync 4.15.16 because: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 0)]]
oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-924acdcf5c6090a539156dc7ff78a6e0   True      False      False      3              3                   3                     0                      280d
worker   rendered-worker-9926a853c2141d58f4bca57e8aea6904   False     True       True       3              0                   0                     1                      280d
  - lastTransitionTime: "2024-08-26T19:19:06Z"
    message: 'Node wrk-0 is reporting: "failed to drain node: wrk-0 after 1 hour.
      Please see machine-config-controller logs for more information"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
RH-csaggin commented 1 month ago

This issue comes from the configuration of the Linux cgroup version on the nodes:

oc get nodes.config/cluster -oyaml | grep cgroup
  cgroupMode: v2

This change forces the creation of new machineconfig which is pending to be applied due to the inability to drain the node, which is caused by 2 different reasons:

  1. Two PODs are stacked in a terminating state:

    I0925 08:06:40.260894       1 drain_controller.go:182] node wrk-0: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: [error when waiting for pod "logs-backing-store-noobaa-pod-29539f6f" in namespace "openshift-storage" to terminate: global timeout reached: 1m30s, error when waiting for pod "logsarchive-backing-store-noobaa-pod-c62b9b95" in namespace "openshift-storage" to terminate: global timeout reached: 1m30s, error when evicting pods/"logging-loki-ingester-0" -n "openshift-logging": global timeout reached: 1m30s]
  2. Typical PDB which prevents PODs from being evicted:

    E0925 07:58:56.289621       1 drain_controller.go:152] error when evicting pods/"logging-loki-ingester-0" -n "openshift-logging" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    E0925 07:58:56.489347       1 drain_controller.go:152] error when evicting pods/"postgres-postgres-5zr7-0" -n "keycloak" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

This is affecting the other PODs running on wrk-0 that are forced to be evicted in a loop leaving the cluster unstable, to fix this we need to force the node to be drained and as ODF is running externally the cluster there is no expected impact.

$ oc adm drain <node> --delete-emptydir-data --ignore-daemonsets --force --disable-eviction

The forced drain will allow the node to reboot with the new render config and then move to the other node, is suggested to monitor this process until all the nodes are updated with the new configuration due to the possible need to force again other nodes.

schwesig commented 1 month ago

status: wrk-0 stucked in not ready @computate @schwesig @RH-csaggin

tssala23 commented 1 month ago

Is wrk-0 currently up and running, I know it is showing not ready in openshift but is the machine itself up. We had an issue on another cluster where the machine itself was failing to boot. It seems like the machine is unreachable so I would assume it's not up. @computate @schwesig @RH-csaggin

[tsalawu@tsalawu-thinkpadx1nanogen2 mocesi]$ oc debug node/wrk-1 --as system:admin
Starting pod/wrk-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.30.9.21
If you don't see a command prompt, try pressing enter.
sh-4.4# ping 10.30.9.20 <--- internal ip for wrk-0
PING 10.30.9.20 (10.30.9.20) 56(84) bytes of data.
From 10.30.9.21 icmp_seq=1 Destination Host Unreachable
From 10.30.9.21 icmp_seq=2 Destination Host Unreachable
From 10.30.9.21 icmp_seq=3 Destination Host Unreachable
From 10.30.9.21 icmp_seq=4 Destination Host Unreachable
^C
--- 10.30.9.20 ping statistics ---
8 packets transmitted, 0 received, +4 errors, 100% packet loss, time 7147ms
tssala23 commented 1 month ago

Just for contrast I am able to ping other nodes in the cluster:

[tsalawu@tsalawu-thinkpadx1nanogen2 mocesi]$ oc debug node/wrk-1 --as system:admin
Starting pod/wrk-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.30.9.21
If you don't see a command prompt, try pressing enter.
sh-4.4# ping 10.30.9.22 <---- internal ip for wrk-2
PING 10.30.9.22 (10.30.9.22) 56(84) bytes of data.
64 bytes from 10.30.9.22: icmp_seq=1 ttl=64 time=1.99 ms
64 bytes from 10.30.9.22: icmp_seq=2 ttl=64 time=0.157 ms
64 bytes from 10.30.9.22: icmp_seq=3 ttl=64 time=0.149 ms
64 bytes from 10.30.9.22: icmp_seq=4 ttl=64 time=0.111 ms
^C
--- 10.30.9.22 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3079ms
schwesig commented 1 month ago

@tssala23 yes, still Not Ready https://console-openshift-console.apps.obs.nerc.mghpcc.org/k8s/cluster/core~v1~Node image

tssala23 commented 1 month ago

@schwesig I know I'm saying the issue might be with the machine itself e.g. this issue caused one of the nodes on my cluster to not boot. Openshift just say not ready but that's not really that useful. I don't have access to the interface to look at the machines Hakan, Naved or Lars would be able to check the state of that node for you.

schwesig commented 1 month ago

@tssala23 ah, ok, got it now. OK, thanks for the connection to the other issue

/CC @RH-csaggin

larsks commented 1 month ago

Node wrk-0 is experiencing memory errors:

$ curl -sSk -u '...' https://10.30.0.86/redfish/v1/Managers/iDRAC.Embedded.1/Logs/Sel |
  jq -r '.Members[]|select(.Severity == "Critical")|[.Created,.Message]|@tsv' |
  grep 2024-09

2024-09-25T09:32:08-05:00       Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
2024-09-25T09:32:08-05:00       Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.

This prevents the node from booting; it is waiting at what is effectively a "press a key to continue" prompt:

image

On a system that experienced a similar problem yesterday, we performed a cold boot and the problem didn't crop up again, but I requested memory replacement for the node. I'll try the same thing here.

larsks commented 1 month ago

It looks as if the node has booted successfully after a cold boot.

larsks commented 1 month ago

wrk-0 now reports ready and wrk-1 is in the process of updating and rebooting.

schwesig commented 1 month ago

wrk-0 up and running, thanks @larsks image

tssala23 commented 1 month ago

@larsks I DIDN'T TOUCH THAT NODE!!! So I can't be the commonality between breaking nodes any more!!! Taj breaks nodes theory has be debunked

schwesig commented 1 month ago

but united still breaks guitars

RH-csaggin commented 1 month ago
oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-924acdcf5c6090a539156dc7ff78a6e0   True      False      False      3              3                   3                     0                      281d
worker   rendered-worker-9926a853c2141d58f4bca57e8aea6904   False     True       False      3              2                   2                     0                      281d
I0925 15:43:22.376014       1 drain_controller.go:152] evicting pod openshift-logging/logging-loki-ingester-0
E0925 15:43:22.383871       1 drain_controller.go:152] error when evicting pods/"logging-loki-ingester-0" -n "openshift-logging" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

wrk-1 will not move if we do not force the deletion of the logging-loki-ingester-0 or force the drain.

RH-csaggin commented 1 month ago

Pod deleted:

oc delete pod logging-loki-ingester-0 -n openshift-logging --as system:admin
pod "logging-loki-ingester-0" deleted

wrk-1 rebooted and now the cluster is stable and updated:

oc get node
NAME    STATUS   ROLES                  AGE    VERSION
ctl-0   Ready    control-plane,master   281d   v1.28.10+a2c84a5
ctl-1   Ready    control-plane,master   281d   v1.28.10+a2c84a5
ctl-2   Ready    control-plane,master   281d   v1.28.10+a2c84a5
wrk-0   Ready    worker                 281d   v1.28.10+a2c84a5
wrk-1   Ready    worker                 281d   v1.28.10+a2c84a5
wrk-2   Ready    worker                 281d   v1.28.10+a2c84a5

❯ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-924acdcf5c6090a539156dc7ff78a6e0   True      False      False      3              3                   3                     0                      281d
worker   rendered-worker-dc3df60ed4593cf6bf70da00e5ea83e8   True      False      False      3              3                   3                     0                      281d
schwesig commented 2 weeks ago

I think we can close this? @RH-csaggin @computate the issue for the Bad memory is in 1394, is this covering it?

computate commented 1 week ago

OK @schwesig we can close.