Closed schwesig closed 2 weeks ago
Thanks @RH-csaggin for collecting these information:
wrk-0 seem dead
oc get node
NAME STATUS ROLES AGE VERSION
ctl-0 Ready control-plane,master 289d v1.28.10+a2c84a5
ctl-1 Ready control-plane,master 289d v1.28.10+a2c84a5
ctl-2 Ready control-plane,master 289d v1.28.10+a2c84a5
wrk-0 NotReady worker 289d v1.28.10+a2c84a5
wrk-1 Ready worker 289d v1.28.10+a2c84a5
wrk-2 Ready worker 289d v1.28.10+a2c84a5
It is replying to ping
sh-5.1# ping 10.30.9.20 -c 3
PING 10.30.9.20 (10.30.9.20) 56(84) bytes of data.
64 bytes from 10.30.9.20: icmp_seq=1 ttl=64 time=2.26 ms
64 bytes from 10.30.9.20: icmp_seq=2 ttl=64 time=0.409 ms
64 bytes from 10.30.9.20: icmp_seq=3 ttl=64 time=0.202 ms
--- 10.30.9.20 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2047ms
rtt min/avg/max/mdev = 0.202/0.955/2.256/0.923 ms
BUT ssh get stuck
sh-5.1# ssh -v core@10.30.9.20
OpenSSH_8.7p1, OpenSSL 3.0.7 1 Nov 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Connecting to 10.30.9.20 [10.30.9.20] port 22.
debug1: Connection established.
debug1: identity file /root/.ssh/id_rsa type -1
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: identity file /root/.ssh/id_dsa type -1
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa_sk type -1
debug1: identity file /root/.ssh/id_ecdsa_sk-cert type -1
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: identity file /root/.ssh/id_ed25519_sk type -1
debug1: identity file /root/.ssh/id_ed25519_sk-cert type -1
debug1: identity file /root/.ssh/id_xmss type -1
debug1: identity file /root/.ssh/id_xmss-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_8.7
^C
/CC @computate @schwesig @RH-csaggin @larsks
wrk-2 also got in an error state
wrk-0
and wrk-2
in nerc-ocp-obs
Clusterwrk-0
low on memorywrk-0
not in ready statusoc get node
wrk-0
still NotReady, others Readywrk-0
responded to ping, SSH stuckwrk-0
and potential memory DIMM replacement?wrk-2
not in ready status.wrk-2
briefly entered "NodeNotReady" state.wrk-2
returned to "NodeReady" status after 10 seconds.wrk-2
status; significant delays in volume info retrieval (3h40m), possibly due to high CPU usage.haproxy
process, suggesting instability possibly due to resource exhaustion.Oct 06 00:10:51 wrk-2 kubenswrapper[4729]: I1006 00:10:51.446108 4729 kubelet_node_status.go:718] "Recording event message for node" node="wrk-2" event="NodeNotReady"
Oct 06 00:11:01 wrk-2 kubenswrapper[4729]: I1006 00:11:01.814895 4729 kubelet_node_status.go:718] "Recording event message for node" node="wrk-2" event="NodeReady"
from Slack, Cristiano
I think this is the flap you got warned from the monitoring, but I see somethings which sounds not right:
Oct 06 00:10:51 wrk-2 kubenswrapper[4729]: I1005 23:51:28.180184
4729 fsHandler.go:133] fs: disk usage and inodes count on following dirs took 3h40m4.255331138s:
[/var/lib/containers/storage/overlay/e9fb10280eda113cc48018896e02ace74a14571af6b644c6f042f6b7c00aa70b/diff ];
will not log again for this container unless duration exceeds 2s
Oct 06 00:10:51 wrk-2 kubenswrapper[4729]: E1006 00:10:48.824690
4729 kubelet.go:2341] "Skipping pod synchronization" err="[container runtime is down,
PLEG is not healthy: pleg was last seen active 4h23m10.34855386s ago; threshold is 3m0s]"
Coredump of the haproxy, might be a consequence of the node status but one more clue of instability :
Oct 06 00:10:53 wrk-2 systemd-coredump[2009246]:
Process 1978213 (haproxy) of user 1000620000 dumped core.
Stack trace of thread 120013:
#0 0x00007fe1a0a6aa9f n/a (/usr/lib64/libc-2.28.so + 0x4ea9f)
ELF object binary architecture: AMD x86-64
more info about PLEG not healthy: https://access.redhat.com/articles/4528671
couldn't recreate it anymore. closing it. open again, if needed.
bug: nerc-ocp-obs: Node wrk-0 Low Memory and Not Ready Status
Node
wrk-0
in thenerc-ocp-obs
cluster has been experiencing low memory and now is in a "NotReady" state. There have been warnings from NERC OCP Messenger about this issue, and further investigations have shown thatDetails
Warnings from NERC OCP Messenger:**
wrk-0
node low on memory.wrk-0
node not in ready status.Current Node Status (reported by Cristiano Saggin, 5:09 AM, 2024-10-04):
Ping Test Results (5:15 AM): Node
wrk-0
is replying to pings, with no packet loss. However, SSH attempts get stuck during the connection establishing phase.Next Steps:
wrk-0
.Action Items
wrk-0
and look on the outcome.