nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.99k stars 1.41k forks source link

OOM killed when using JetStream WorkQueue [v2.10.18] #5739

Open kohlisid opened 3 months ago

kohlisid commented 3 months ago

Observed behavior

Running 2 streams on a jetstream cluster with 3 nodes

The server config are as follows


Process Details:

         Version: 2.10.18
      Git Commit: 57d23ac
      Go Version: go1.22.5
      Start Time: 2024-08-02 01:00:45.487109016 +0000 UTC
          Uptime: 34m59s

Connection Details:

   Auth Required: true
    TLS Required: false
            Host: 0.0.0.0:4222
     Client URLs: 10.244.0.91:4222
                  10.244.0.92:4222
                  10.244.0.90:4222

JetStream:

              Domain: 
   Storage Directory: /data/jetstream/store/jetstream
          Max Memory: 768 MiB
            Max File: 1.0 TiB
      Active Acconts: 1
       Memory In Use: 0 B
         File In Use: 738 B
        API Requests: 7,948
          API Errors: 0

Limits:

        Max Conn: 65536
        Max Subs: 0
     Max Payload: 64 MiB
     TLS Timeout: 2s
  Write Deadline: 10s

Statistics:

       CPU Cores: 32 218.00%
          Memory: 162 MiB
     Connections: 8
   Subscriptions: 540
            Msgs: 682,490 in 730,334 out
           Bytes: 1.0 TiB in 1.2 TiB out
  Slow Consumers: 0

Cluster:

            Name: default
            Host: 0.0.0.0:6222
            URLs: isbsvc-default-js-0.isbsvc-default-js-svc.default.svc:6222
                  isbsvc-default-js-1.isbsvc-default-js-svc.default.svc:6222
                  isbsvc-default-js-2.isbsvc-default-js-svc.default.svc:6222

Stream configs are as follows ->

~ # js s info default-simple-pipeline-out-0
Information for Stream default-simple-pipeline-out-0 created 2024-08-02 01:02:04

             Subjects: default-simple-pipeline-out-0
             Replicas: 3
              Storage: File

Options:

            Retention: WorkQueue
     Acknowledgements: true
       Discard Policy: New
     Duplicate Window: 1m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: 10,000
  Maximum Per Subject: unlimited
        Maximum Bytes: 100 MiB
          Maximum Age: 3d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

Cluster Information:

                 Name: default
               Leader: isbsvc-default-js-2
              Replica: isbsvc-default-js-0, current, seen 0.01s ago
              Replica: isbsvc-default-js-1, outdated, seen 0.00s ago, 1 operation behind

State:

             Messages: 7
                Bytes: 44 MiB
             FirstSeq: 9,422 @ 2024-08-02T01:23:27 UTC
              LastSeq: 9,428 @ 2024-08-02T01:23:27 UTC
     Active Consumers: 1
   Number of Subjects: 1
Information for Stream default-simple-pipeline-cat-0 created 2024-08-02 01:02:03

             Subjects: default-simple-pipeline-cat-0
             Replicas: 3
              Storage: File

Options:

            Retention: WorkQueue
     Acknowledgements: true
       Discard Policy: New
     Duplicate Window: 1m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: 10,000
  Maximum Per Subject: unlimited
        Maximum Bytes: 100 MiB
          Maximum Age: 3d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

Cluster Information:

                 Name: default
               Leader: isbsvc-default-js-0
              Replica: isbsvc-default-js-1, current, seen 0.00s ago, 2 operations behind
              Replica: isbsvc-default-js-2, current, seen 0.00s ago

State:

             Messages: 5
                Bytes: 32 MiB
             FirstSeq: 3,897 @ 2024-08-02T01:10:42 UTC
              LastSeq: 4,220 @ 2024-08-02T01:11:26 UTC
     Deleted Messages: 319
     Active Consumers: 1
   Number of Subjects: 1

I am running a test where publish messages with payload of size 5mb to the streams, but I saw that the servers got terminated due to OOM killed

    State:        Running
    Started:      Fri, 02 Aug 2024 01:05:07 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 02 Aug 2024 01:00:45 +0000
      Finished:     Fri, 02 Aug 2024 01:05:06 +0000
    Ready:          True
    Restart Count:  1
    Limits:
      memory:  1Gi
    Requests:
      cpu:     1
      memory:  1Gi

    State:          Running
    Started:      Fri, 02 Aug 2024 01:04:24 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 02 Aug 2024 01:00:45 +0000
      Finished:     Fri, 02 Aug 2024 01:04:23 +0000
    Ready:          True
    Restart Count:  1

Expected behavior

I am running a test with payload size of 5Mb with 50msgs/sec

[55] 2024/08/02 01:00:45.460331 [INF] Starting nats-server
[55] 2024/08/02 01:00:45.460447 [INF]   Version:  2.10.18

[55] 2024/08/02 01:00:45.462256 [INF] ---------------- JETSTREAM ----------------
[55] 2024/08/02 01:00:45.462259 [INF]   Max Memory:      768.00 MB
[55] 2024/08/02 01:00:45.462261 [INF]   Max Storage:     1.00 TB
[55] 2024/08/02 01:00:45.462262 [INF]   Store Directory: "/data/jetstream/store/jetstream"
[55] 2024/08/02 01:00:45.462263 [INF] -------------------------------------------

There are 2 streams with replication factor of 3

I see this OOM killed error in 2 servers, I want to ask why would I be seeing this error when I'm using filestore as storage mode. Will the streams be pulled in-memory at runtime?

Is this high memory requirement expected? If yes, then what will be a good way to detemine what should be the max resource allocation to the Jetstream servers.

Server and client version

Version: 2.10.18

Host environment

No response

Steps to reproduce

No response

kohlisid commented 3 months ago

Perhaps related to #5673 But we are setting stream limits of maxBytes and maxMessages

neilalexander commented 3 months ago

How much memory are you giving each NATS node?

kohlisid commented 3 months ago

@neilalexander

[55] 2024/08/02 01:00:45.462256 [INF] ---------------- JETSTREAM ----------------
[55] 2024/08/02 01:00:45.462259 [INF]   Max Memory:      768.00 MB
[55] 2024/08/02 01:00:45.462261 [INF]   Max Storage:     1.00 TB
neilalexander commented 3 months ago

I asked specifically because sometimes the reporting doesn't always match cgroup restrictions, do you OOM as quickly if you set the GOMEMLIMIT environment variable to some floor below the available memory, i.e. GOMEMLIMIT=512MiB?

If you can capture some memory profiles when the memory usage is up but before OOM then that would be useful, either nats server request profile allocs from the system account or https://docs.nats.io/running-a-nats-service/nats_admin/profiling.

ripienaar commented 3 months ago

The Max Memory shown does not limit how much memory NATS uses. It limits how much memory storage a stream may use and has no impact on NATS overall memory usage for its normal running needs.

What are you giving the process from your OS perspective?

kohlisid commented 3 months ago

I'm running the the cluster on K8s, with each JS server as a pod giving a memory of memory: 1Gi @ripienaar

ripienaar commented 3 months ago

Huge payload messages would consume more ram, ime I wouldn't run nats with below 3GB memory when using Jetstream. Your big messages will make matters worse.

kohlisid commented 3 months ago

But on the flipside how would we estimate how much memory resources to allocate in such a scenario? Is there some guideline. If there is a surge in message rate, we would need to ensure that it doesnt OOM @ripienaar

ripienaar commented 3 months ago

It varies a lot by use case, message rates etc and changes from version to version - usually the needs go down. A workload that today uses about 1GB memory for me used to use 6GB some time ago.

You should set up monitoring and use nats bench to simulate various scenarios, there's some guidance on the website about memory usage but its definitely wrong so doing your own tests would be best.

kohlisid commented 3 months ago

@ripienaar I agree to trying to simulate the scenarios, but for a high load this memory was spiking up a lot. I cannot run my nodes without memory limit guardrails, hence I was trying to mimic the scenario here using limited resources to see when would it OOM. I assumed even for a worst case scenario that the whole stream is pulled into the cache (even though I'm using filestore storage) having a maxBytes of 100Mb for 2 streams and replication=3 (600Mb). But this ceiling is getting hit more often than not.

For high work loads giving unbounded memory isn't possible, I want to understand where and why would the server be using the excess memory, and thus how to decide the upper bounds in that case.

kohlisid commented 3 months ago

debug/pprof/allocs mem_pprof.zip @neilalexander

derekcollison commented 3 months ago

A good practice when limiting memory via cgroups and containers is to set the env GOMEMLIMIT to ~75% of actual limit. Sometimes the Go runtime only sees the host value of memory and does not feel pressure to GC and clean things up but the container and linux kernel will OOM it.

kohlisid commented 3 months ago

That could help in terms of the golang GC, but in terms of the nats what would be a good limit upper limit in the first place? How should we calculate that. Even when I consider the worst case where all streams can reside in memory, I saw the OOM as stated in the above issue For a filestore mode, we would not expect that to be the ideal case @derekcollison

derekcollison commented 3 months ago

For any program written in Go that is to be run in a container its best practice these days to set GOMEMLIMIT.

We do plan on introducing a high bound on how much buffer memory for all filestore backed streams in a server. This will come probably in 2.12.

kohlisid commented 3 months ago

I agree on the GOMEMLIMIT, will definitely go ahead on changing that. But as a practice at this point, can we estimate the upper bound memory for the container? Based on the TPS and message size etc. Or some experimental settings that can help to measure that @derekcollison

derekcollison commented 3 months ago

Memory usage is quite dynamic based on connections, number of subscriptions for core. And for JetStream, number of streams and consumers, access patterns etc..

Best way is to model the upper bounds of what you expect on a larger system and monitor RSS peaks..

kohlisid commented 2 months ago

On a recent exploration with Limits policy, I was seeing the memory spiking up considerably again.

│                                                                                                      Stream Report                                                                                                      │
├──────────────────────────────────────────────────────────────────────────────┬─────────┬───────────┬───────────┬──────────┬─────────┬──────┬─────────┬──────────────────────────────────────────────────────────────────┤
│ Stream                                                                       │ Storage │ Placement │ Consumers │ Messages │ Bytes   │ Lost │ Deleted │ Replicas                                                         │
├──────────────────────────────────────────────────────────────────────────────┼─────────┼───────────┼───────────┼──────────┼─────────┼──────┼─────────┼──────────────────────────────────────────────────────────────────┤
│ KV_xxxx-simple-pipeline-out_SINK_OT          │ File    │           │ 0         │ 0        │ 0 B     │ 0    │ 0       │ isbsvc-default-js-1*, isbsvc-default-js-3, isbsvc-default-js-4!  │
│ KV_xxxx-simple-pipeline_SIDE_INPUTS          │ File    │           │ 0         │ 0        │ 0 B     │ 0    │ 0       │ isbsvc-default-js-1, isbsvc-default-js-2*, isbsvc-default-js-3   │
│ KV_xxxx-simple-pipeline-out_SINK_PROCESSORS  │ File    │           │ 0         │ 8        │ 1.1 KiB │ 0    │ 2002    │ isbsvc-default-js-1*, isbsvc-default-js-2, isbsvc-default-js-3   │
│ KV_xxxx-simple-pipeline-in-out_OT            │ File    │           │ 10        │ 10       │ 1.3 KiB │ 0    │ 9       │ isbsvc-default-js-0, isbsvc-default-js-1, isbsvc-default-js-3*   │
│ KV_xxxx-simple-pipeline-in-out_PROCESSORS    │ File    │           │ 7         │ 10       │ 1.3 KiB │ 0    │ 0       │ isbsvc-default-js-2, isbsvc-default-js-3*, isbsvc-default-js-4!  │
│ KV_xxxx-simple-pipeline-in_SOURCE_OT         │ File    │           │ 24        │ 10       │ 1.3 KiB │ 0    │ 1       │ isbsvc-default-js-0!, isbsvc-default-js-1!, isbsvc-default-js-4! │
│ KV_xxxx-simple-pipeline-in_SOURCE_PROCESSORS │ File    │           │ 10        │ 10       │ 1.3 KiB │ 0    │ 0       │ isbsvc-default-js-0!, isbsvc-default-js-2, isbsvc-default-js-3*  │
│ xxxx-simple-pipeline-out-1                   │ File    │           │ 1         │ 100,000  │ 6.2 GiB │ 0    │ 0       │ isbsvc-default-js-0!, isbsvc-default-js-2!, isbsvc-default-js-4! │
│ xxxx-simple-pipeline-out-0                   │ File    │           │ 1         │ 100,000  │ 6.2 GiB │ 0    │ 0       │ isbsvc-default-js-0!, isbsvc-default-js-1*, isbsvc-default-js-4  │
╰──────────────────────────────────────────────────────────────────────────────┴─────────┴───────────┴───────────┴──────────┴─────────┴──────┴─────────┴──────────────────────────────────────────────────────────────────╯

In this case I have two streams running with size of 6.2Gib running in filestore mode, but the memory usage on the servers were considerably high. More than having both the streams completely in memory even though we have filestore.

~ # njs server ls
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                           Server Overview                                                           │
├─────────────────────┬─────────┬──────┬─────────┬─────┬───────┬───────┬────────┬─────┬─────────┬───────┬───────┬──────┬────────┬─────┤
│ Name                │ Cluster │ Host │ Version │ JS  │ Conns │ Subs  │ Routes │ GWs │ Mem     │ CPU % │ Cores │ Slow │ Uptime │ RTT │
├─────────────────────┼─────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────┼─────┤
│ isbsvc-default-js-2 │ default │ 0    │ 2.10.18 │ yes │ 13    │ 663   │ 16     │ 0   │ 447 MiB │ 3     │ 32    │ 7    │ 41m35s │ 1ms │
│ isbsvc-default-js-3 │ default │ 0    │ 2.10.18 │ yes │ 9     │ 663   │ 16     │ 0   │ 32 MiB  │ 2     │ 32    │ 0    │ 41m33s │ 1ms │
│ isbsvc-default-js-0 │ default │ 0    │ 2.10.18 │ yes │ 0     │ 663   │ 16     │ 0   │ 16 GiB  │ 18    │ 32    │ 0    │ 4m20s  │ 1ms │
│ isbsvc-default-js-4 │ default │ 0    │ 2.10.18 │ yes │ 0     │ 663   │ 16     │ 0   │ 16 GiB  │ 100   │ 32    │ 0    │ 2m49s  │ 2ms │
│ isbsvc-default-js-1 │ default │ 0    │ 2.10.18 │ yes │ 12    │ 663   │ 16     │ 0   │ 434 MiB │ 5     │ 32    │ 9    │ 41m31s │ 2ms │
├─────────────────────┼─────────┼──
────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────┼─────┤
│                     │ 1       │ 5    │         │ 5   │ 34    │ 3,315 │        │     │ 33 GiB  │       │       │ 16   │        │     │
╰─────────────────────┴─────────┴──────┴─────────┴─────┴───────┴───────┴────────┴─────┴─────────┴───────┴───────┴──────┴────────┴─────╯

Here are the memory profiles for the same mem_prof.zip

@derekcollison @neilalexander

neilalexander commented 2 months ago

This looks like a build-up of Raft append entries. How are you publishing messages? Are you using core NATS publishes or JetStream publishes?

kohlisid commented 2 months ago

@neilalexander We use the Jetstream publish

// PublishMsg publishes a Msg to JetStream.
PublishMsg(m *Msg, opts ...PubOpt) (*PubAck, error)

https://github.com/nats-io/nats.go/blob/main/js.go#L45

kohlisid commented 2 months ago

I see a buildup during the writeMsgRecord as well mem_prof_2.zip @neilalexander @derekcollison

kohlisid commented 1 month ago

@derekcollison @neilalexander Any pointer for the above?

neilalexander commented 1 month ago

Max Payload: 64 MiB

What is the actual size of the messages you are publishing to the stream?

This max payload setting is abnormally high (our default is 1MB) and it looks like the allocations in the above profile are due to storing very large messages. Would recommend you bring the max payload back down.

You may be better served by the object store instead which chunks larger blobs down into smaller messages if you need to store large volumes of data.

kohlisid commented 1 month ago

@neilalexander Even though the max payload size is set at 64Mib The average payload size is 60kb

│ xxxx-simple-pipeline-out-1                   │ File    │           │ 1         │ 100,000  │ 6.2 GiB │ 0    │ 0       │ isbsvc-default-js-0!, isbsvc-default-js-2!, isbsvc-default-js-4! │
│ xxxx-simple-pipeline-out-0                   │ File    │           │ 1         │ 100,000  │ 6.2 GiB │ 0    │ 0       │ isbsvc-default-js-0!, isbsvc-default-js-1*, isbsvc-default-js-4  │

You can see that stream has 100000 messages with total size as 6.2Gib so the calculations work out as expected ~60Kb

I could lower the max payload size, but the payload size is fixed coming from a controlled data source.

Any other things that can be tried out?

cc @derekcollison