Open NevinDry opened 1 month ago
Can you please report the output of free -m
within the containers when the memory usage is high?
Thanks for your answer @neilalexander. Our stg1 containers restarted yesterday so the memory has not increased much yet. I will keep you updated when the memory is high. Here is the free command output in stg1 ATM (env where the memory increase over time) :
STG1 CPU | Memory 1 vCPU | 2 GB NODE / # free -h total used free shared buff/cache available Mem: 3.6G 565.7M 292.2M 540.0K 2.8G 2.8G Swap: 0 0 0
SEED CPU | Memory 1 vCPU | 2 GB / # free -h total used free shared buff/cache available Mem: 3.8G 580.9M 275.3M 540.0K 2.9G 2.9G Swap: 0 0 0
On dev, where the memory is stable, here is a free command output (note that the CPU/Memory provisioning is not the same, does this could have an impact ?) : DEV
NODE CPU | Memory .25 vCPU | .5 GB / # free -h total used free shared buff/cache available Mem: 927.8M 547.5M 67.0M 540.0K 313.3M 240.1M Swap: 0 0 0
SEED CPU | Memory .25 vCPU | .5 GB / # free -h total used free shared buff/cache available Mem: 927.8M 559.9M 80.8M 544.0K 287.1M 229.2M Swap: 0 0 0
(note that the total/used memory for both env are higher than the provisioned memory we set on our containers.
Thank you for your help.
What stands out to me is the buff/cache
utilisation, which makes me think you're falling victim to kubernetes/kubernetes#43916. In short, Kubernetes is considering the kernel page cache when deciding whether a pod is under memory pressure. I suspect if you look at the RSS size (as is reported by nats server ls
for example) that you'd see the process utilisation itself is stable.
Do you set both a memory request and a memory limit, or just one or the other?
Hi @neilalexander, we did not have memory soft/hardlimit set on our fargate containers. We are going to configure it and see what happens, I will keep you updated. On another note, we observed that only containers with more than default memory value (512M) have their memory increasing over time. Thanks for your guidance !
Observed behavior
We observe that our aws fargate containers serving clustered NATS have their memory usage increasing over time :
(metric : ecs.fargate.mem.usage)
What is weird is we can observe this behavior on our STG1 environement but not on our DEV environment. These two environments have no significant activity, and are deployed the same way through IaC. Nats is deployed using clustering on AWS fargate.
The difference in memory usage between the two envs is very significant : dev:
stg1:
If we look at the NATS memory metrics, both environements are stable : dev:
stg1:
There must be a leak somewhere but we are unable identify it.
Expected behavior
The fargate containers memory shouldn't be increasing over time and should stay stable. It should follow the NATS memory metrics and stay stable.
Server and client version
Server version : 2.10.18 Go : go1.22.5
Host environment
NATS clustering inside AWS fargate without jetstream.
Operating system/Architecture Linux/X86_64 CPU | Memory 2 vCPU | 4 GB Platform version 1.4.0 Launch type FARGATE
Log-router and Datadog as side containers.
Steps to reproduce
No response