Open Kaarel opened 4 months ago
How are you reporting memory usage? Is this including buffers/caches or is this the resident set size of the NATS Server process?
Please fetch memory profiles from your NATS system using nats server request profile allocs
under the system account.
The stats on diagram are container_memory_usage_bytes
which as per https://prometheus.io/docs/guides/cadvisor/ is "The cgroup's total memory usage"
This is how we run NATS docker run --network host --pid host nats
Do you have a Synadia contact (slack/email) to use for profiles file.
The container_memory_usage_bytes
will include the kernel buffers/caches. Can you please check whether the container_memory_working_set_bytes
shows the same pattern?
Feel free to send profiles directly to me at neil@nats.io.
Yes the container_memory_working_set_bytes
mirrors container_memory_usage_bytes
. On the screenshot the sharp drop is where we restarted nodes.
Observed behavior
I haven't got much more than INFO level nats logs and RAM usage at this stage. The memory leak is slow enough that it never gets to a point where it hurts because we upgrade to new versions as they are released. But wanted to bring your attention to this anyway as the memory profile looks of concern. Or maybe this is expected until an system memory limit is reached?
We have a 3 node cluster. Mostly JS and .net clients. Some PHP. All WorkQueue or Interest streams. Couple small KV_buckets. There is traffic but messages from streams are consumed immediately so most of the time empty.
My primary concern is the general steady memory increase.
For whatever reason the last drop on one of the nodes around 07/02 mark coincides with creating and deleting a temp stream for testing purposes. Nodes n1 and n2 had no server logs for that time period. Node n3 had just these two lines in its server logs:
Also notably maybe, the drop in memory usage occurred not on the node with these log lines but on another node...
If there is any other info I could provide let me know.
Expected behavior
I guess I would naively expect the memory consumption to plateau at some point. Something like what one of the cluster nodes demonstrates during the period of 2024/07/02 to 2024/07/12 (see attached screenshot). But that too started to grow after 07/12.
Server and client version
2.10.14 and 2.10.16 See attached screenshot for details.
Host environment
No response
Steps to reproduce
No response