nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

bug: nerc-ocp-obs: Node wrk-0 Low Memory and Not Ready Status #761

Open schwesig opened 2 days ago

schwesig commented 2 days ago

bug: nerc-ocp-obs: Node wrk-0 Low Memory and Not Ready Status

Node wrk-0 in the nerc-ocp-obs cluster has been experiencing low memory and now is in a "NotReady" state. There have been warnings from NERC OCP Messenger about this issue, and further investigations have shown that

Details

Action Items

  1. Reboot node wrk-0 and look on the outcome.
  2. Investigate how this memory issue can be permanently solved to avoid such issues in the future.
schwesig commented 2 days ago

Thanks @RH-csaggin for collecting these information:

wrk-0 seem dead

oc get node
NAME    STATUS     ROLES                  AGE    VERSION
ctl-0   Ready      control-plane,master   289d   v1.28.10+a2c84a5
ctl-1   Ready      control-plane,master   289d   v1.28.10+a2c84a5
ctl-2   Ready      control-plane,master   289d   v1.28.10+a2c84a5
wrk-0   NotReady   worker                 289d   v1.28.10+a2c84a5
wrk-1   Ready      worker                 289d   v1.28.10+a2c84a5
wrk-2   Ready      worker                 289d   v1.28.10+a2c84a5

It is replying to ping

sh-5.1# ping 10.30.9.20 -c 3
PING 10.30.9.20 (10.30.9.20) 56(84) bytes of data.
64 bytes from 10.30.9.20: icmp_seq=1 ttl=64 time=2.26 ms
64 bytes from 10.30.9.20: icmp_seq=2 ttl=64 time=0.409 ms
64 bytes from 10.30.9.20: icmp_seq=3 ttl=64 time=0.202 ms

--- 10.30.9.20 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2047ms
rtt min/avg/max/mdev = 0.202/0.955/2.256/0.923 ms

BUT ssh get stuck

sh-5.1# ssh -v core@10.30.9.20
OpenSSH_8.7p1, OpenSSL 3.0.7 1 Nov 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Connecting to 10.30.9.20 [10.30.9.20] port 22.
debug1: Connection established.
debug1: identity file /root/.ssh/id_rsa type -1
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: identity file /root/.ssh/id_dsa type -1
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa_sk type -1
debug1: identity file /root/.ssh/id_ecdsa_sk-cert type -1
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: identity file /root/.ssh/id_ed25519_sk type -1
debug1: identity file /root/.ssh/id_ed25519_sk-cert type -1
debug1: identity file /root/.ssh/id_xmss type -1
debug1: identity file /root/.ssh/id_xmss-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_8.7
^C
schwesig commented 2 days ago

/CC @computate @schwesig @RH-csaggin @larsks