Consistent memory leak through versions 2.10.9 -> 2.10.17 -> 2.10.20.

valeraBr commented 1 month ago

Observed behavior

Recently, we encountered a strange memory leak in our NATS brokers. Memory consumption is continuously growing without any parallel workload being noticed. The memory continues growing until the pod is OOM (The limit is 8GB). GOMEMLIMIT Is configured to 6GiB. We have this issue for several NATS versions already, and the upgrade didnt help 2.10.9 -> 2.10.17 -> 2.10.20. The command server ls shows that the consumption is low, as it seen below: server ls

╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                       Server Overview                                                       │
├────────┬─────────┬──────┬─────────┬─────┬───────┬───────┬────────┬─────┬─────────┬───────┬───────┬──────┬─────────────┬─────┤
│ Name   │ Cluster │ Host │ Version │ JS  │ Conns │ Subs  │ Routes │ GWs │ Mem     │ CPU % │ Cores │ Slow │ Uptime      │ RTT │
├────────┼─────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼─────────────┼─────┤
│ nats-2 │ nats    │ 0    │ 2.10.20 │ yes │ 22    │ 991   │      8 │   0 │ 55 MiB  │ 22    │     8 │    0 │ 6d17h56m14s │ 2ms │
│ nats-0 │ nats    │ 0    │ 2.10.20 │ yes │ 16    │ 989   │      8 │   0 │ 44 MiB  │ 18    │     8 │    0 │ 3d23h28m35s │ 2ms │
│ nats-1 │ nats    │ 0    │ 2.10.20 │ yes │ 19    │ 988   │      8 │   0 │ 72 MiB  │ 26    │     8 │    0 │ 6d17h55m30s │ 2ms │
├────────┼─────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼─────────────┼─────┤
│        │ 1       │ 3    │         │ 3   │ 57    │ 2,968 │        │     │ 170 MIB │       │       │    0 │             │     │
╰────────┴─────────┴──────┴─────────┴─────┴───────┴───────┴────────┴─────┴─────────┴───────┴───────┴──────┴─────────────┴─────╯

╭────────────────────────────────────────────────────────────────────────────╮
│                              Cluster Overview                              │
├─────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├─────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ nats    │          3 │                 0 │                 0 │          57 │
├─────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│         │          3 │                 0 │                 0 │          57 │
╰─────────┴────────────┴───────────────────┴───────────────────┴─────────────╯

But the K8's metrics show the following: NAME CPU(cores) MEMORY(bytes) nats-0 127m 5977Mi nats-1 288m 5994Mi nats-2 219m 5971Mi We know the overall reason for this behavior but not the RC, its started to happen after we started to use heavily the stream clients_async_tasks

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                            Stream Report                                                           │
├──────────────────────────────────┬─────────┬───────────┬───────────┬──────────┬─────────┬──────┬─────────┬─────────────────────────┤
│ Stream                           │ Storage │ Placement │ Consumers │ Messages │ Bytes   │ Lost │ Deleted │ Replicas                │
├──────────────────────────────────┼─────────┼───────────┼───────────┼──────────┼─────────┼──────┼─────────┼─────────────────────────┤
│ OBJ_reports_store_v2             │ File    │           │         0 │ 0        │ 0 B     │ 0    │       0 │ nats-0, nats-1*, nats-2 │
│ clients_async_tasks              │ File    │           │         1 │ 0        │ 0 B     │ 0    │       0 │ nats-0, nats-1*, nats-2 │
│ async_tasks_v1                   │ File    │           │        24 │ 1        │ 972 B   │ 0    │   64284 │ nats-0, nats-1, nats-2* │
│ m_notifications                  │ File    │           │         1 │ 1        │ 280 KiB │ 0    │       0 │ nats-0, nats-1, nats-2* │
│ audit                            │ File    │           │         0 │ 100,933  │ 21 MiB  │ 0    │       0 │ nats-0, nats-1*, nats-2 │
╰──────────────────────────────────┴─────────┴───────────┴───────────┴──────────┴─────────┴──────┴─────────┴─────────────────────────╯

stream info
? Select a Stream clients_async_tasks
Information for Stream clients_async_tasks created 2024-09-26 09:43:58

              Subjects: clients_async_tasks.>
              Replicas: 3
               Storage: File

Options:

             Retention: WorkQueue
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: unlimited
         Maximum Bytes: unlimited
           Maximum Age: 1d0h0m0s
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-1
               Replica: nats-0, current, seen 898µs ago
               Replica: nats-2, current, seen 689µs ago

State:

              Messages: 2
                 Bytes: 634 B
        First Sequence: 773,811,097 @ 2024-10-01 06:23:40
         Last Sequence: 773,811,098 @ 2024-10-01 06:23:40
      Active Consumers: 1
    Number of Subjects: 1

cat /proc/8/environ
KUBERNETES_SERVICE_PORT=443KUBERNETES_PORT=tcp://10.100.0.1:443HOSTNAME=nats-0SHLVL=1HOME=/rootNATS_SERVER=2.10.20POD_NAME=nats-0KUBERNETES_PORT_443_TCP_ADDR=10.100.0.1GOMEMLIMIT=6GiBPATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binKUBERNETES_PORT_443_TCP_PORT=443KUBERNETES_PORT_443_TCP_PROTO=tcpKUBERNETES_SERVICE_PORT_HTTPS=443KUBERNETES_PORT_443_TCP=tcp://10.100.0.1:443KUBERNETES_SERVICE_HOST=10.100.0.1PWD=/SERVER_NAME=nats-0/

We have two workarounds for this issue for now:

Restart pods one by one. ( Obvious )
Manual manipulations with the problematic stream:
- Edit replica 3 ->1—Drop the memory consumption immediately for the 2 brokers who are not leaders. The leader's memory is still high.
- Edit replica 1 ->3 and force new leader election - drop the memory for the leader as well.
- After step "b," the stream does not act as it should. The retention is not working till another replica changes from 3 -> 1 and vice versa heap_files.tar.gz

Expected behavior

The described stream is a stream that works a lot with small-size messages, and its being empty all the time. We would expect that the memory consumption will not grow ~1.5 GiB per day till the OOM.

Server and client version

The issues appear in the versions:

2.10.9
2.10.17
2.10.20.appear

Host environment

EKS Cluster 1.30 Nodes - 8CPU, 32 RAM

Steps to reproduce

We have several staging environments with the same configuration (less traffic) and its not reproducable. We have several staging environments with the same configuration (less traffic), and it's not reproducible.

neilalexander commented 1 month ago

Something's not right with the memory profile you attached, can you please take another?

~/Downloads/heap_files/mem.prof: parsing profile: unrecognized profile format

valeraBr commented 1 month ago

Hello @neilalexander heap_files.tar.gz ,

Sorry for late response, it took time to get to the high memory and cpu values to take the profile files again. I have took it once again and validated this time that I can open them using the command: go tool pprof -http=:8080 ~/Downloads/heap_files/mem.prof

Hope its better this time.

neilalexander commented 1 month ago

This one's better, thanks, but it's only showing ~9MB heap. How are you measuring the memory usage?

valeraBr commented 1 month ago

I thought it will not be shown in the heap because it doesnt looks like the GO is using this memory + nats server ls shown before is not in a correlation of a numbers I see in the K8's based metrics both using kubectl and Grafana dashboards.

`k top pods -n prod-server
NAME                                                CPU(cores)   MEMORY(bytes)
nats-0                                              365m         6234Mi
nats-1                                              507m         6037Mi
nats-2                                              559m         6809Mi`

Screenshot 2024-10-22 at 13 48 15

neilalexander commented 1 month ago

This feels a lot like a duplicate of #5870 and #5881, can you please verify if the memory usage inside the pod is buff/cache using free -m?

valeraBr commented 1 month ago

Yes, the #5870 looks similar, but are you sure regarding the #5881 ?

/ # free -m
              total        used        free      shared  buff/cache   available
Mem:          31686        2900       13603           7       15183       28402
Swap:             0           0           0

The memory value is lower now because I did a restart yesterday.

k top pods -n prod-server
NAME                                                CPU(cores)   MEMORY(bytes)
nats-0                                              236m         1472Mi
nats-1                                              270m         1476Mi
nats-2                                              364m         1477Mi

neilalexander commented 1 month ago

Can you also do nats server ls from the system account and also find the RSS of the nats-server process inside the container?

We have found a number of times recently with users in K8s environment that K8s misreports the memory usage because it includes the kernel page cache, which can cause problems with scheduling pods.

eugenestarchenko commented 1 month ago

Have you noticed the same behaviour in 2.10.22?

valeraBr commented 1 month ago

@neilalexander Current status:

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                       Server Overview                                                      │
├────────┬─────────┬──────┬─────────┬─────┬───────┬───────┬────────┬─────┬─────────┬───────┬───────┬──────┬────────────┬─────┤
│ Name   │ Cluster │ Host │ Version │ JS  │ Conns │ Subs  │ Routes │ GWs │ Mem     │ CPU % │ Cores │ Slow │ Uptime     │ RTT │
├────────┼─────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼─────┤
│ nats-1 │ nats    │ 0    │ 2.10.20 │ yes │ 17    │ 1,474 │      8 │   0 │ 42 MiB  │ 10    │     8 │    1 │ 2d6h50m29s │ 2ms │
│ nats-0 │ nats    │ 0    │ 2.10.20 │ yes │ 18    │ 1,481 │      8 │   0 │ 52 MiB  │ 8     │     8 │    0 │ 2d6h48m56s │ 2ms │
│ nats-2 │ nats    │ 0    │ 2.10.20 │ yes │ 21    │ 1,477 │      8 │   0 │ 40 MiB  │ 12    │     8 │    0 │ 2d6h52m22s │ 2ms │
├────────┼─────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼─────┤
│        │ 1       │ 3    │         │ 3   │ 56    │ 4,432 │        │     │ 134 MIB │       │       │    1 │            │     │
╰────────┴─────────┴──────┴─────────┴─────┴───────┴───────┴────────┴─────┴─────────┴───────┴───────┴──────┴────────────┴─────╯

╭────────────────────────────────────────────────────────────────────────────╮
│                              Cluster Overview                              │
├─────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├─────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ nats    │          3 │                 0 │                 0 │          56 │
├─────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│         │          3 │                 0 │                 0 │          56 │
╰─────────┴────────────┴───────────────────┴───────────────────┴─────────────╯

k top pods -n prod-server
NAME                                                CPU(cores)   MEMORY(bytes)
nats-0                                              103m         3606Mi
nats-1                                              86m          3609Mi
nats-2                                              307m         3623Mi

/ # ps aux
PID   USER     TIME  COMMAND
    1 65535     0:00 /pause
    7 root     14h33 nats-server --config /etc/nats-config/nats.conf
   25 root      0:00 /nats-server-config-reloader -pid /var/run/nats/nats.pid -config /etc/nats-config/nats.conf -config /etc/nats-certs/nats/tls.crt -config /etc/nats-ce
   38 root      1:53 /prometheus-nats-exporter -port=7777 -connz -routez -subz -varz -prefix=nats -use_internal_server_id -jsz=all http://localhost:8222/
   58 root      0:00 sh -c clear; (bash || ash || sh)
   65 root      0:00 ash
   77 root      0:00 sh -c clear; (bash || ash || sh)
   84 root      0:00 ash
   85 root      0:00 ps aux
/ # ps aux -eo pid,ppid,rss,vsz
PID   PPID  RSS  VSZ
    1     0    4  760
    7     0  42m 1.2g
   25     0 2688 1.1g
   38     0  15m 1.1g
   58     0    4 1720
   65    58 1212 1816
   77     0    4 1720
   84    77 1264 1788
   86    84    4 1712

valeraBr commented 3 weeks ago

Hi @neilalexander , Do you have any insights on the stats I last shared? Thanks.

neilalexander commented 3 weeks ago

Hi @valeraBr, I think this is a duplicate of the above issues. It's kernel page cache, not NATS resident set usage.

I don't know what that we can do anything about it. This is surely a bug in how Kubernetes reports memory usage.

valeraBr commented 3 weeks ago

Thanks for answering. If this is the issue, I don't see any reason why it happens only to the NATS pods. I have additional applications in the same namespace, and it never happens to me with them.

neilalexander commented 3 weeks ago

The JetStream filestore is block-based where each block is only at most a few MB, often smaller. As you continuously publish to streams, more blocks get created, which results in more files on disk, which results in more entries in the kernel page cache (because there's memory available to cache them, which is a performance optimisation).

In normal kernel operation, the kernel page entries for the least-recently used files would get purged automatically in response to memory demand from applications, so this build-up wouldn't actually matter — if something else needed the memory, it would be reclaimed. This is why it's not right that Kubernetes counts this as pod memory usage, it's effectively volatile and can be reclaimed at any time.

I expect your other applications either aren't growing files at a continuous rate, or aren't creating lots of new files, otherwise they would likely exhibit the same issue.

valeraBr commented 3 weeks ago

Thanks for the detailed explanation; now it's much more apparent. Indeed, our stream deals with a lot of small messages, so according to your answer, I can assume that NATS is not the ideal solution for us or is just not ready to run in a production environment on top of K8s. Any advice on how we can solve it? It's not an option right now to run NATS on VMs unfortunately.

neilalexander commented 3 weeks ago

I think you can probably work around this by reducing the pods memory limit, and then setting both the memory reservation and the memory limit for the pods to the same value. That should contain the size of the page cache down to a manageable amount.

neilalexander commented 3 weeks ago

Also just want to stress that this isn't really a case of NATS not being ready for Kubernetes, but rather that this is a classic resource allocation issue. If you tell a pod it can have X GB memory, you have to expect that it will use anything up to that.

Setting the memory limit and the memory reservation to the same value will "ringfence" the memory and prevent overprovisioning, but then you have to expect that the available memory in the pod is also eligible to be used for the page cache, as we're seeing here.

nats-io / nats-server