nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

bug: nerc-ocp-obs: Node wrk-0 Low Memory and Not Ready Status #761

Closed schwesig closed 2 weeks ago

schwesig commented 1 month ago

bug: nerc-ocp-obs: Node wrk-0 Low Memory and Not Ready Status

Node wrk-0 in the nerc-ocp-obs cluster has been experiencing low memory and now is in a "NotReady" state. There have been warnings from NERC OCP Messenger about this issue, and further investigations have shown that

Details

Action Items

  1. Reboot node wrk-0 and look on the outcome.
  2. Investigate how this memory issue can be permanently solved to avoid such issues in the future.
schwesig commented 1 month ago

Thanks @RH-csaggin for collecting these information:

wrk-0 seem dead

oc get node
NAME    STATUS     ROLES                  AGE    VERSION
ctl-0   Ready      control-plane,master   289d   v1.28.10+a2c84a5
ctl-1   Ready      control-plane,master   289d   v1.28.10+a2c84a5
ctl-2   Ready      control-plane,master   289d   v1.28.10+a2c84a5
wrk-0   NotReady   worker                 289d   v1.28.10+a2c84a5
wrk-1   Ready      worker                 289d   v1.28.10+a2c84a5
wrk-2   Ready      worker                 289d   v1.28.10+a2c84a5

It is replying to ping

sh-5.1# ping 10.30.9.20 -c 3
PING 10.30.9.20 (10.30.9.20) 56(84) bytes of data.
64 bytes from 10.30.9.20: icmp_seq=1 ttl=64 time=2.26 ms
64 bytes from 10.30.9.20: icmp_seq=2 ttl=64 time=0.409 ms
64 bytes from 10.30.9.20: icmp_seq=3 ttl=64 time=0.202 ms

--- 10.30.9.20 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2047ms
rtt min/avg/max/mdev = 0.202/0.955/2.256/0.923 ms

BUT ssh get stuck

sh-5.1# ssh -v core@10.30.9.20
OpenSSH_8.7p1, OpenSSL 3.0.7 1 Nov 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Connecting to 10.30.9.20 [10.30.9.20] port 22.
debug1: Connection established.
debug1: identity file /root/.ssh/id_rsa type -1
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: identity file /root/.ssh/id_dsa type -1
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa_sk type -1
debug1: identity file /root/.ssh/id_ecdsa_sk-cert type -1
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: identity file /root/.ssh/id_ed25519_sk type -1
debug1: identity file /root/.ssh/id_ed25519_sk-cert type -1
debug1: identity file /root/.ssh/id_xmss type -1
debug1: identity file /root/.ssh/id_xmss-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_8.7
^C
schwesig commented 1 month ago

/CC @computate @schwesig @RH-csaggin @larsks

schwesig commented 1 month ago

wrk-2 also got in an error state

Timeline of Findings for Node wrk-0 and wrk-2 in nerc-ocp-obs Cluster

2024-10-03

2024-10-04

2024-10-05

2024-10-06

schwesig commented 1 month ago
Oct 06 00:10:51 wrk-2 kubenswrapper[4729]: I1006 00:10:51.446108    4729 kubelet_node_status.go:718] "Recording event message for node" node="wrk-2" event="NodeNotReady"
Oct 06 00:11:01 wrk-2 kubenswrapper[4729]: I1006 00:11:01.814895    4729 kubelet_node_status.go:718] "Recording event message for node" node="wrk-2" event="NodeReady"
schwesig commented 1 month ago

from Slack, Cristiano

I think this is the flap you got warned from the monitoring, but I see somethings which sounds not right:

  1. before this flap the logs shows a huge amount of delays getting volumes information (high CPU??) but after the flap it stops
    Oct 06 00:10:51 wrk-2 kubenswrapper[4729]: I1005 23:51:28.180184
    4729 fsHandler.go:133] fs: disk usage and inodes count on following dirs took 3h40m4.255331138s: 
    [/var/lib/containers/storage/overlay/e9fb10280eda113cc48018896e02ace74a14571af6b644c6f042f6b7c00aa70b/diff ];
    will not log again for this container unless duration exceeds 2s
  2. PLEG is not healthy logs, which usually relate to the fact that kubelet is not able to ask CRIO the status of the container running.
    Oct 06 00:10:51 wrk-2 kubenswrapper[4729]: E1006 00:10:48.824690
    4729 kubelet.go:2341] "Skipping pod synchronization" err="[container runtime is down,
    PLEG is not healthy: pleg was last seen active 4h23m10.34855386s ago; threshold is 3m0s]"
  3. Coredump of the haproxy, might be a consequence of the node status but one more clue of instability :

    Oct 06 00:10:53 wrk-2 systemd-coredump[2009246]:
    Process 1978213 (haproxy) of user 1000620000 dumped core.
    
                                                 Stack trace of thread 120013:
                                                 #0  0x00007fe1a0a6aa9f n/a (/usr/lib64/libc-2.28.so + 0x4ea9f)
                                                 ELF object binary architecture: AMD x86-64
schwesig commented 1 month ago

more info about PLEG not healthy: https://access.redhat.com/articles/4528671

schwesig commented 2 weeks ago

couldn't recreate it anymore. closing it. open again, if needed.