AWS Fargate Memory slowly increasing/leak over time

NevinDry commented 1 month ago

Observed behavior

We observe that our aws fargate containers serving clustered NATS have their memory usage increasing over time :

(metric : ecs.fargate.mem.usage)

What is weird is we can observe this behavior on our STG1 environement but not on our DEV environment. These two environments have no significant activity, and are deployed the same way through IaC. Nats is deployed using clustering on AWS fargate.

The difference in memory usage between the two envs is very significant : dev:

stg1:

If we look at the NATS memory metrics, both environements are stable : dev:

stg1:

There must be a leak somewhere but we are unable identify it.

Expected behavior

The fargate containers memory shouldn't be increasing over time and should stay stable. It should follow the NATS memory metrics and stay stable.

Server and client version

Server version : 2.10.18 Go : go1.22.5

Host environment

NATS clustering inside AWS fargate without jetstream.

Operating system/Architecture Linux/X86_64 CPU | Memory 2 vCPU | 4 GB Platform version 1.4.0 Launch type FARGATE

Log-router and Datadog as side containers.

Steps to reproduce

No response

neilalexander commented 1 month ago

Can you please report the output of free -m within the containers when the memory usage is high?

NevinDry commented 1 month ago

Thanks for your answer @neilalexander. Our stg1 containers restarted yesterday so the memory has not increased much yet. I will keep you updated when the memory is high. Here is the free command output in stg1 ATM (env where the memory increase over time) :

STG1 CPU | Memory 1 vCPU | 2 GB NODE / # free -h total used free shared buff/cache available Mem: 3.6G 565.7M 292.2M 540.0K 2.8G 2.8G Swap: 0 0 0

SEED CPU | Memory 1 vCPU | 2 GB / # free -h total used free shared buff/cache available Mem: 3.8G 580.9M 275.3M 540.0K 2.9G 2.9G Swap: 0 0 0

On dev, where the memory is stable, here is a free command output (note that the CPU/Memory provisioning is not the same, does this could have an impact ?) : DEV

NODE CPU | Memory .25 vCPU | .5 GB / # free -h total used free shared buff/cache available Mem: 927.8M 547.5M 67.0M 540.0K 313.3M 240.1M Swap: 0 0 0

SEED CPU | Memory .25 vCPU | .5 GB / # free -h total used free shared buff/cache available Mem: 927.8M 559.9M 80.8M 544.0K 287.1M 229.2M Swap: 0 0 0

(note that the total/used memory for both env are higher than the provisioned memory we set on our containers.

Thank you for your help.

neilalexander commented 1 month ago

What stands out to me is the buff/cache utilisation, which makes me think you're falling victim to kubernetes/kubernetes#43916. In short, Kubernetes is considering the kernel page cache when deciding whether a pod is under memory pressure. I suspect if you look at the RSS size (as is reported by nats server ls for example) that you'd see the process utilisation itself is stable.

Do you set both a memory request and a memory limit, or just one or the other?

NevinDry commented 1 month ago

Hi @neilalexander, we did not have memory soft/hardlimit set on our fargate containers. We are going to configure it and see what happens, I will keep you updated. On another note, we observed that only containers with more than default memory value (512M) have their memory increasing over time. Thanks for your guidance !

nats-io / nats-server