Nats server memory usage slowly increases over time

Kaarel commented 4 months ago

Observed behavior

I haven't got much more than INFO level nats logs and RAM usage at this stage. The memory leak is slow enough that it never gets to a point where it hurts because we upgrade to new versions as they are released. But wanted to bring your attention to this anyway as the memory profile looks of concern. Or maybe this is expected until an system memory limit is reached?

We have a 3 node cluster. Mostly JS and .net clients. Some PHP. All WorkQueue or Interest streams. Couple small KV_buckets. There is traffic but messages from streams are consumed immediately so most of the time empty.

My primary concern is the general steady memory increase.

For whatever reason the last drop on one of the nodes around 07/02 mark coincides with creating and deleting a temp stream for testing purposes. Nodes n1 and n2 had no server logs for that time period. Node n3 had just these two lines in its server logs:

[3750] 2024/07/02 14:33:51.591074 [INF] JetStream cluster new stream leader for 'USERS > test'
[3750] 2024/07/02 14:45:30.614779 [INF] JetStream cluster new stream leader for 'USERS > test1'

Also notably maybe, the drop in memory usage occurred not on the node with these log lines but on another node...

If there is any other info I could provide let me know.

Expected behavior

I guess I would naively expect the memory consumption to plateau at some point. Something like what one of the cluster nodes demonstrates during the period of 2024/07/02 to 2024/07/12 (see attached screenshot). But that too started to grow after 07/12.

Server and client version

2.10.14 and 2.10.16 See attached screenshot for details.

Host environment

No response

Steps to reproduce

No response

neilalexander commented 4 months ago

How are you reporting memory usage? Is this including buffers/caches or is this the resident set size of the NATS Server process?

Please fetch memory profiles from your NATS system using nats server request profile allocs under the system account.

Kaarel commented 4 months ago

The stats on diagram are container_memory_usage_bytes which as per https://prometheus.io/docs/guides/cadvisor/ is "The cgroup's total memory usage"

This is how we run NATS docker run --network host --pid host nats

Do you have a Synadia contact (slack/email) to use for profiles file.

neilalexander commented 4 months ago

The container_memory_usage_bytes will include the kernel buffers/caches. Can you please check whether the container_memory_working_set_bytes shows the same pattern?

Feel free to send profiles directly to me at neil@nats.io.

Kaarel commented 3 months ago

Yes the container_memory_working_set_bytes mirrors container_memory_usage_bytes. On the screenshot the sharp drop is where we restarted nodes.

nats-io / nats-server