Closed clezag closed 1 year ago
1st attempt:
cluster is composed of two r7i.large nodes with 2CPU, 16GB RAM each.
Using the stress-ng tool on a random pod, I simulated a CPU overload, which didn't do anything in particular. Using the same tool, I also simulated a memory overload. After some time, the node became somewhat unresponsive and went into status "NodeNotReady", but it seems that k8s is able to evict the offending pod (it says so in the logs) and free up memory, recovering on it's own.
using kubectl top nodes
though, it's clear that a substantial amount of memory is used even when the system is running fine, supporting our memory hypothesis:
After recreating both nodes, they sit at 6.7 and 4.2 GB memory, respectively.
Trying to reproduce the previous failure condition, we're scaling down the instances to two m6i.large, with only have 8 GB RAM each and will repeat the memory pressure test
on the two 8GB instances I was able to reproduce the failure state, with the same error messages as when it occurred naturally.
I think it's confirmed to be a memory issue now.
To reproduce the issue:
❯ kubectl exec --stdin --tty dc-echarging-alperia-5dbf994477-tlspv -- /bin/bash
dc-echarging-alperia-5dbf994477-tlspv:/app# apk add stress-ng
dc-echarging-alperia-5dbf994477-tlspv:/app# stress-ng --brk 2 --stack 2 --bigheap 2
As a lesson learned, we should closer monitor the memory usage of our nodes, provision memory and cpu limits, and follow up on pods that use unreasonable amounts of memory (of which there are a number already).
Trying to replicate the mysterious outages we had with the cluster. At some point nodes kept failing in a somewhat random manner, entering a cycle of recreating new nodes and failing again. This presents as a node becoming "NodeNotReady", and an error of "Kubelet stopped posting status".
In the past scaling up has solved the issue, but it's not clear if that was only incidental. Our current theory is that this was caused by an
We should try to reproduce the issue in a controlled manner so that we can understand how to mitigate it before moving to production