Open valeraBr opened 1 month ago
Something's not right with the memory profile you attached, can you please take another?
~/Downloads/heap_files/mem.prof: parsing profile: unrecognized profile format
Hello @neilalexander heap_files.tar.gz ,
Sorry for late response, it took time to get to the high memory and cpu values to take the profile files again.
I have took it once again and validated this time that I can open them using the command:
go tool pprof -http=:8080 ~/Downloads/heap_files/mem.prof
Hope its better this time.
This one's better, thanks, but it's only showing ~9MB heap. How are you measuring the memory usage?
I thought it will not be shown in the heap because it doesnt looks like the GO is using this memory + nats server ls shown before is not in a correlation of a numbers I see in the K8's based metrics both using kubectl and Grafana dashboards.
`k top pods -n prod-server
NAME CPU(cores) MEMORY(bytes)
nats-0 365m 6234Mi
nats-1 507m 6037Mi
nats-2 559m 6809Mi`
This feels a lot like a duplicate of #5870 and #5881, can you please verify if the memory usage inside the pod is buff/cache
using free -m
?
Yes, the #5870 looks similar, but are you sure regarding the #5881 ?
/ # free -m
total used free shared buff/cache available
Mem: 31686 2900 13603 7 15183 28402
Swap: 0 0 0
The memory value is lower now because I did a restart yesterday.
k top pods -n prod-server
NAME CPU(cores) MEMORY(bytes)
nats-0 236m 1472Mi
nats-1 270m 1476Mi
nats-2 364m 1477Mi
Can you also do nats server ls
from the system account and also find the RSS of the nats-server
process inside the container?
We have found a number of times recently with users in K8s environment that K8s misreports the memory usage because it includes the kernel page cache, which can cause problems with scheduling pods.
Have you noticed the same behaviour in 2.10.22?
@neilalexander Current status:
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Server Overview │
├────────┬─────────┬──────┬─────────┬─────┬───────┬───────┬────────┬─────┬─────────┬───────┬───────┬──────┬────────────┬─────┤
│ Name │ Cluster │ Host │ Version │ JS │ Conns │ Subs │ Routes │ GWs │ Mem │ CPU % │ Cores │ Slow │ Uptime │ RTT │
├────────┼─────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼─────┤
│ nats-1 │ nats │ 0 │ 2.10.20 │ yes │ 17 │ 1,474 │ 8 │ 0 │ 42 MiB │ 10 │ 8 │ 1 │ 2d6h50m29s │ 2ms │
│ nats-0 │ nats │ 0 │ 2.10.20 │ yes │ 18 │ 1,481 │ 8 │ 0 │ 52 MiB │ 8 │ 8 │ 0 │ 2d6h48m56s │ 2ms │
│ nats-2 │ nats │ 0 │ 2.10.20 │ yes │ 21 │ 1,477 │ 8 │ 0 │ 40 MiB │ 12 │ 8 │ 0 │ 2d6h52m22s │ 2ms │
├────────┼─────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼─────┤
│ │ 1 │ 3 │ │ 3 │ 56 │ 4,432 │ │ │ 134 MIB │ │ │ 1 │ │ │
╰────────┴─────────┴──────┴─────────┴─────┴───────┴───────┴────────┴─────┴─────────┴───────┴───────┴──────┴────────────┴─────╯
╭────────────────────────────────────────────────────────────────────────────╮
│ Cluster Overview │
├─────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├─────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ nats │ 3 │ 0 │ 0 │ 56 │
├─────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ │ 3 │ 0 │ 0 │ 56 │
╰─────────┴────────────┴───────────────────┴───────────────────┴─────────────╯
k top pods -n prod-server
NAME CPU(cores) MEMORY(bytes)
nats-0 103m 3606Mi
nats-1 86m 3609Mi
nats-2 307m 3623Mi
/ # ps aux
PID USER TIME COMMAND
1 65535 0:00 /pause
7 root 14h33 nats-server --config /etc/nats-config/nats.conf
25 root 0:00 /nats-server-config-reloader -pid /var/run/nats/nats.pid -config /etc/nats-config/nats.conf -config /etc/nats-certs/nats/tls.crt -config /etc/nats-ce
38 root 1:53 /prometheus-nats-exporter -port=7777 -connz -routez -subz -varz -prefix=nats -use_internal_server_id -jsz=all http://localhost:8222/
58 root 0:00 sh -c clear; (bash || ash || sh)
65 root 0:00 ash
77 root 0:00 sh -c clear; (bash || ash || sh)
84 root 0:00 ash
85 root 0:00 ps aux
/ # ps aux -eo pid,ppid,rss,vsz
PID PPID RSS VSZ
1 0 4 760
7 0 42m 1.2g
25 0 2688 1.1g
38 0 15m 1.1g
58 0 4 1720
65 58 1212 1816
77 0 4 1720
84 77 1264 1788
86 84 4 1712
Hi @neilalexander , Do you have any insights on the stats I last shared? Thanks.
Hi @valeraBr, I think this is a duplicate of the above issues. It's kernel page cache, not NATS resident set usage.
I don't know what that we can do anything about it. This is surely a bug in how Kubernetes reports memory usage.
Thanks for answering. If this is the issue, I don't see any reason why it happens only to the NATS pods. I have additional applications in the same namespace, and it never happens to me with them.
The JetStream filestore is block-based where each block is only at most a few MB, often smaller. As you continuously publish to streams, more blocks get created, which results in more files on disk, which results in more entries in the kernel page cache (because there's memory available to cache them, which is a performance optimisation).
In normal kernel operation, the kernel page entries for the least-recently used files would get purged automatically in response to memory demand from applications, so this build-up wouldn't actually matter — if something else needed the memory, it would be reclaimed. This is why it's not right that Kubernetes counts this as pod memory usage, it's effectively volatile and can be reclaimed at any time.
I expect your other applications either aren't growing files at a continuous rate, or aren't creating lots of new files, otherwise they would likely exhibit the same issue.
Thanks for the detailed explanation; now it's much more apparent. Indeed, our stream deals with a lot of small messages, so according to your answer, I can assume that NATS is not the ideal solution for us or is just not ready to run in a production environment on top of K8s. Any advice on how we can solve it? It's not an option right now to run NATS on VMs unfortunately.
I think you can probably work around this by reducing the pods memory limit, and then setting both the memory reservation and the memory limit for the pods to the same value. That should contain the size of the page cache down to a manageable amount.
Also just want to stress that this isn't really a case of NATS not being ready for Kubernetes, but rather that this is a classic resource allocation issue. If you tell a pod it can have X GB memory, you have to expect that it will use anything up to that.
Setting the memory limit and the memory reservation to the same value will "ringfence" the memory and prevent overprovisioning, but then you have to expect that the available memory in the pod is also eligible to be used for the page cache, as we're seeing here.
Observed behavior
Recently, we encountered a strange memory leak in our NATS brokers. Memory consumption is continuously growing without any parallel workload being noticed. The memory continues growing until the pod is OOM (The limit is 8GB). GOMEMLIMIT Is configured to 6GiB. We have this issue for several NATS versions already, and the upgrade didnt help
2.10.9 -> 2.10.17 -> 2.10.20.
The command server ls shows that the consumption is low, as it seen below:server ls
But the K8's metrics show the following: NAME CPU(cores) MEMORY(bytes) nats-0 127m 5977Mi nats-1 288m 5994Mi nats-2 219m 5971Mi We know the overall reason for this behavior but not the RC, its started to happen after we started to use heavily the stream clients_async_tasks
We have two workarounds for this issue for now:
Expected behavior
The described stream is a stream that works a lot with small-size messages, and its being empty all the time. We would expect that the memory consumption will not grow ~1.5 GiB per day till the OOM.
Server and client version
The issues appear in the versions:
Host environment
EKS Cluster 1.30 Nodes - 8CPU, 32 RAM
Steps to reproduce
We have several staging environments with the same configuration (less traffic) and its not reproducable. We have several staging environments with the same configuration (less traffic), and it's not reproducible.