schwesig commented 1 month ago

bug: nerc-ocp-obs: Node wrk-0 Low Memory and Not Ready Status

Node wrk-0 in the nerc-ocp-obs cluster has been experiencing low memory and now is in a "NotReady" state. There have been warnings from NERC OCP Messenger about this issue, and further investigations have shown that

while the node is replying to pings,
SSH access is getting stuck.

Details

Warnings from NERC OCP Messenger:**
- 10:35 PM, 2024-10-03: wrk-0 node low on memory.
- 2:27 AM, 2024-10-04: wrk-0 node not in ready status.

Current Node Status (reported by Cristiano Saggin, 5:09 AM, 2024-10-04):

NAME    STATUS     ROLES                  AGE    VERSION
ctl-0   Ready      control-plane,master   289d   v1.28.10+a2c84a5
ctl-1   Ready      control-plane,master   289d   v1.28.10+a2c84a5
ctl-2   Ready      control-plane,master   289d   v1.28.10+a2c84a5
wrk-0   NotReady   worker                 289d   v1.28.10+a2c84a5
wrk-1   Ready      worker                 289d   v1.28.10+a2c84a5
wrk-2   Ready      worker                 289d   v1.28.10+a2c84a5

Ping Test Results (5:15 AM): Node wrk-0 is replying to pings, with no packet loss. However, SSH attempts get stuck during the connection establishing phase.
Next Steps:
- Node Reboot wrk-0.
- Monitor Booting Process: Monitor the booting process of the node for any sign which needs manual intervention.
- Hardware: If the issue is related to a bad memory DIMM, substitute the memory module/node/...?

Action Items

Reboot node wrk-0 and look on the outcome.
Investigate how this memory issue can be permanently solved to avoid such issues in the future.

schwesig commented 1 month ago

Thanks @RH-csaggin for collecting these information:

wrk-0 seem dead

oc get node
NAME    STATUS     ROLES                  AGE    VERSION
ctl-0   Ready      control-plane,master   289d   v1.28.10+a2c84a5
ctl-1   Ready      control-plane,master   289d   v1.28.10+a2c84a5
ctl-2   Ready      control-plane,master   289d   v1.28.10+a2c84a5
wrk-0   NotReady   worker                 289d   v1.28.10+a2c84a5
wrk-1   Ready      worker                 289d   v1.28.10+a2c84a5
wrk-2   Ready      worker                 289d   v1.28.10+a2c84a5

It is replying to ping

sh-5.1# ping 10.30.9.20 -c 3
PING 10.30.9.20 (10.30.9.20) 56(84) bytes of data.
64 bytes from 10.30.9.20: icmp_seq=1 ttl=64 time=2.26 ms
64 bytes from 10.30.9.20: icmp_seq=2 ttl=64 time=0.409 ms
64 bytes from 10.30.9.20: icmp_seq=3 ttl=64 time=0.202 ms

--- 10.30.9.20 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2047ms
rtt min/avg/max/mdev = 0.202/0.955/2.256/0.923 ms

BUT ssh get stuck

sh-5.1# ssh -v core@10.30.9.20
OpenSSH_8.7p1, OpenSSL 3.0.7 1 Nov 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Connecting to 10.30.9.20 [10.30.9.20] port 22.
debug1: Connection established.
debug1: identity file /root/.ssh/id_rsa type -1
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: identity file /root/.ssh/id_dsa type -1
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa_sk type -1
debug1: identity file /root/.ssh/id_ecdsa_sk-cert type -1
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: identity file /root/.ssh/id_ed25519_sk type -1
debug1: identity file /root/.ssh/id_ed25519_sk-cert type -1
debug1: identity file /root/.ssh/id_xmss type -1
debug1: identity file /root/.ssh/id_xmss-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_8.7
^C

schwesig commented 1 month ago

/CC @computate @schwesig @RH-csaggin @larsks

schwesig commented 1 month ago

wrk-2 also got in an error state

Timeline of Findings for Node `wrk-0` and `wrk-2` in `nerc-ocp-obs` Cluster

2024-10-03

10:35 PM: NERC OCP Messenger wrk-0 low on memory

2024-10-04

2:27 AM: NERC OCP Messenger wrk-0 not in ready status
5:09 AM: oc get node wrk-0 still NotReady, others Ready
5:15 AM: wrk-0 responded to ping, SSH stuck
5:28 AM: Cristiano suggested reboot wrk-0 and potential memory DIMM replacement?

2024-10-05

4:04 PM: NERC OCP Messenger wrk-2 not in ready status.

2024-10-06

12:10 AM: Logs wrk-2 briefly entered "NodeNotReady" state.
12:11 AM: wrk-2 returned to "NodeReady" status after 10 seconds.
4:14 AM: brief flap in wrk-2 status; significant delays in volume info retrieval (3h40m), possibly due to high CPU usage.
4:27 AM:
- PLEG not healthy, kubelet couldn't communicate with CRI-O for over 4h23m.
- Coredump of haproxy process, suggesting instability possibly due to resource exhaustion.

schwesig commented 1 month ago

Oct 06 00:10:51 wrk-2 kubenswrapper[4729]: I1006 00:10:51.446108    4729 kubelet_node_status.go:718] "Recording event message for node" node="wrk-2" event="NodeNotReady"
Oct 06 00:11:01 wrk-2 kubenswrapper[4729]: I1006 00:11:01.814895    4729 kubelet_node_status.go:718] "Recording event message for node" node="wrk-2" event="NodeReady"

schwesig commented 1 month ago

from Slack, Cristiano

I think this is the flap you got warned from the monitoring, but I see somethings which sounds not right:

before this flap the logs shows a huge amount of delays getting volumes information (high CPU??) but after the flap it stops

Oct 06 00:10:51 wrk-2 kubenswrapper[4729]: I1005 23:51:28.180184
4729 fsHandler.go:133] fs: disk usage and inodes count on following dirs took 3h40m4.255331138s: 
[/var/lib/containers/storage/overlay/e9fb10280eda113cc48018896e02ace74a14571af6b644c6f042f6b7c00aa70b/diff ];
will not log again for this container unless duration exceeds 2s

PLEG is not healthy logs, which usually relate to the fact that kubelet is not able to ask CRIO the status of the container running.

Oct 06 00:10:51 wrk-2 kubenswrapper[4729]: E1006 00:10:48.824690
4729 kubelet.go:2341] "Skipping pod synchronization" err="[container runtime is down,
PLEG is not healthy: pleg was last seen active 4h23m10.34855386s ago; threshold is 3m0s]"

Coredump of the haproxy, might be a consequence of the node status but one more clue of instability :

Oct 06 00:10:53 wrk-2 systemd-coredump[2009246]:
Process 1978213 (haproxy) of user 1000620000 dumped core.

                                             Stack trace of thread 120013:
                                             #0  0x00007fe1a0a6aa9f n/a (/usr/lib64/libc-2.28.so + 0x4ea9f)
                                             ELF object binary architecture: AMD x86-64

schwesig commented 1 month ago

more info about PLEG not healthy: https://access.redhat.com/articles/4528671

schwesig commented 2 weeks ago

couldn't recreate it anymore. closing it. open again, if needed.

nerc-project / operations