noi-techpark / infrastructure-v2

Open Data Hub Infrastructure v2 Repository
1 stars 0 forks source link

Simulate CPU and Memory load and try to provoke an outage #46

Closed clezag closed 1 year ago

clezag commented 1 year ago

Trying to replicate the mysterious outages we had with the cluster. At some point nodes kept failing in a somewhat random manner, entering a cycle of recreating new nodes and failing again. This presents as a node becoming "NodeNotReady", and an error of "Kubelet stopped posting status".

In the past scaling up has solved the issue, but it's not clear if that was only incidental. Our current theory is that this was caused by an

We should try to reproduce the issue in a controlled manner so that we can understand how to mitigate it before moving to production

clezag commented 1 year ago

1st attempt:

cluster is composed of two r7i.large nodes with 2CPU, 16GB RAM each.

Using the stress-ng tool on a random pod, I simulated a CPU overload, which didn't do anything in particular. Using the same tool, I also simulated a memory overload. After some time, the node became somewhat unresponsive and went into status "NodeNotReady", but it seems that k8s is able to evict the offending pod (it says so in the logs) and free up memory, recovering on it's own.

using kubectl top nodes though, it's clear that a substantial amount of memory is used even when the system is running fine, supporting our memory hypothesis: After recreating both nodes, they sit at 6.7 and 4.2 GB memory, respectively.

Trying to reproduce the previous failure condition, we're scaling down the instances to two m6i.large, with only have 8 GB RAM each and will repeat the memory pressure test

clezag commented 1 year ago

on the two 8GB instances I was able to reproduce the failure state, with the same error messages as when it occurred naturally.

I think it's confirmed to be a memory issue now.

To reproduce the issue:

❯ kubectl exec --stdin --tty dc-echarging-alperia-5dbf994477-tlspv -- /bin/bash
dc-echarging-alperia-5dbf994477-tlspv:/app# apk add stress-ng
dc-echarging-alperia-5dbf994477-tlspv:/app# stress-ng --brk 2 --stack 2 --bigheap 2

As a lesson learned, we should closer monitor the memory usage of our nodes, provision memory and cpu limits, and follow up on pods that use unreasonable amounts of memory (of which there are a number already).