Server crash loop on jetstream catchup

tylerschoppe commented 9 months ago

Observed behavior

When recovering a replicated 300gb ~700 million message stream via network for a new server and volume, catchup finishes and is almost immediately OOM'd, entering a crash loop. This behavior has been observed on 2.9.16, 2.9.24 and 2.10.7. Upgrading was done in hopes that memory performance improvements would resolve the issue. Across all versions, readiness probe timeout was increased, memory was set to 8gb request with a 14gb limit, the existing volume was replaced with a clean volume for the failing server, and an additional server was added, but still, the same end result persisted. Specifically for version 2.10, the server connection pool was expanded, and compression was applied to speed up the process, also without success.

During catch up heath check logs show raft is falling behind and the stream is not current. No other logs are present before restart.

[WRN] RAFT [DkfQicS9 - S-R3F-BcE6CEMM] Falling behind in health check, commit 6574900126 != applied 0
[WRN] JetStream stream '<account-key> > scada' is not current

Server configuration:

# NATS Clients Port
port: 4222

# PID file shared with configuration reloader.
pid_file: "/var/run/nats/nats.pid"

###############
#             #
# Monitoring  #
#             #
###############
http: 8222
server_name:$SERVER_NAME
###################################
#                                 #
# NATS JetStream                  #
#                                 #
###################################
jetstream {
  max_mem: 1Gi
  store_dir: /data

  max_file: 500Gi
}
###################################
#                                 #
# NATS Full Mesh Clustering Setup #
#                                 #
###################################
cluster {
  port: 6222
  name: nats

  routes = [
    nats://nats-0.nats.data-platform.svc.cluster.local:6222,nats://nats-1.nats.data-platform.svc.cluster.local:6222,nats://nats-2.nats.data-platform.svc.cluster.local:6222,

  ]
  cluster_advertise: $CLUSTER_ADVERTISE

  connect_retries: 120
  pool_size: 12
  compression: {
    mode: s2_best
  }
}
lame_duck_grace_period: 10s
lame_duck_duration: 30s
##################
#                #
# Authorization  #
#                #
##################
        operator: <operator-jwt>
        system_account: "<system-account-key>"

      resolver: {
        type: full
        dir: "/etc/nats-config/accounts/jwt"

        allow_delete: true

        interval: "2m"
      }

Slack post discussing the issue for reference: https://natsio.slack.com/archives/CM3T6T7JQ/p1702059984084359

Hopefully this is enough information but please let me know any additional context I can provide. Thanks in advance!

Expected behavior

A server should be able to successfully recover a stream over the network without restart and crash loop.

Server and client version

Server versions: 2.10.7, 2.9.24 and 2.9.16 Client: 0.0.35

Host environment

NATS is running as a 3 server cluster on k8s with helm chart version 0.19.13, pooling and compression manually applied via config map. Nats pods are the only workload scheduled on nodes running in different availability zones.

Steps to reproduce

No response

tylerschoppe commented 9 months ago

Tried both increasing and decreasing jetstream.max_outstanding_catchup but ultimately the same result. Seems to be something about the transition from catchup to streaming live again. Currently stuck in a restart loop where the server is able to recover the progress made up until the crash loop start from disk. Logs showing:

[WRN] RAFT [DkfQicS9 - _meta_] Falling behind in health check, commit 60834 != applied 60734
[WRN] Healthcheck failed: "JetStream is not current with the meta leader"

applied is never increasing while commit does occasionally make progress. Seems as if these healthchecks are correct and something under the hood with raft is unable to start advancing the commits again.

derekcollison commented 9 months ago

With so many moving pieces we would most likely need to jump on a Zoom call to triage.

derekcollison commented 9 months ago

Does your setup have GOMEMLIMIT set?

tylerschoppe commented 9 months ago

Thanks for the response! GOMEMLIMIT was not set but observed the same behavior when setting at 12GiB for the crash loop server and a server with a new volume. It did not appear to make a difference when monitoring pod memory consumption as we still pegged the resource limit at 14GiB.

enabling gctrace showed us immediate runaway behavior on the problematic server while looking great on the others: problematic server:

gc 10 @1.888s 3%: 0.034+67+0.012 ms clock, 0.068+0.20/33/46+0.025 ms cpu, 623->647->479 MB, 707 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 11 @2.590s 3%: 0.034+79+0.011 ms clock, 0.068+0.68/46/58+0.023 ms cpu, 862->896->662 MB, 959 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 12 @3.579s 3%: 0.033+114+0.002 ms clock, 0.067+0.45/56/81+0.005 ms cpu, 1210->1257->930 MB, 1326 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 13 @5.080s 2%: 0.037+133+0.010 ms clock, 0.074+3.8/76/101+0.021 ms cpu, 1746->1803->1313 MB, 1860 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 14 @7.195s 2%: 0.037+222+0.012 ms clock, 0.075+1.1/115/151+0.024 ms cpu, 2478->2576->1884 MB, 2626 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 15 @10.136s 2%: 0.039+308+0.036 ms clock, 0.078+1.1/160/214+0.073 ms cpu, 3536->3682->2696 MB, 3769 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 16 @14.383s 2%: 0.041+367+0.003 ms clock, 0.082+0.97/192/269+0.006 ms cpu, 5043->5213->3810 MB, 5393 MB goal, 0 MB stacks, 0 MB globals, 2 P
[51] 2023/12/12 17:21:33.238248 [WRN] RAFT [DkfQicS9 - _meta_] Falling behind in health check, commit 62401 != applied 62317
[51] 2023/12/12 17:21:33.238262 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
gc 17 @20.253s 2%: 0.038+573+0.011 ms clock, 0.076+1.6/301/367+0.023 ms cpu, 7145->7424->5447 MB, 7621 MB goal, 0 MB stacks, 0 MB globals, 2 P
[51] 2023/12/12 17:21:43.238365 [WRN] RAFT [DkfQicS9 - _meta_] Falling behind in health check, commit 62401 != applied 62317
[51] 2023/12/12 17:21:43.238382 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
gc 18 @28.911s 2%: 0.040+903+0.003 ms clock, 0.080+1.9/463/610+0.007 ms cpu, 10243->10627->7768 MB, 10895 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 19 @34.238s 2%: 0.041+988+0.012 ms clock, 0.082+2.9/499/751+0.024 ms cpu, 10645->11059->9217 MB, 11563 MB goal, 0 MB stacks, 0 MB globals, 2 P
[51] 2023/12/12 17:21:53.238181 [WRN] RAFT [DkfQicS9 - _meta_] Falling behind in health check, commit 62401 != applied 62317
[51] 2023/12/12 17:21:53.238193 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
gc 20 @37.807s 3%: 0.040+1119+0.012 ms clock, 0.080+1.8/565/768+0.025 ms cpu, 10816->11343->10211 MB, 11521 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 21 @40.252s 4%: 0.064+1146+0.010 ms clock, 0.12+181/583/725+0.020 ms cpu, 11070->11524->10737 MB, 11494 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 22 @42.131s 4%: 0.079+975+0.003 ms clock, 0.15+403/501/631+0.007 ms cpu, 11205->11471->10953 MB, 11463 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 23 @43.647s 5%: 0.064+943+0.011 ms clock, 0.12+605/479/491+0.022 ms cpu, 11261->11437->11108 MB, 11435 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 24 @44.933s 6%: 0.055+866+0.023 ms clock, 0.11+547/442/520+0.046 ms cpu, 11312->11429->11212 MB, 11427 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 25 @46.046s 7%: 0.062+817+0.011 ms clock, 0.12+709/408/380+0.023 ms cpu, 11346->11419->11273 MB, 11419 MB goal, 0 MB stacks, 0 MB globals, 2 P
[51] 2023/12/12 17:22:03.359464 [WRN] RAFT [DkfQicS9 - _meta_] Falling behind in health check, commit 62401 != applied 62317
[51] 2023/12/12 17:22:03.359486 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
gc 26 @47.036s 8%: 0.054+789+0.002 ms clock, 0.10+442/395/647+0.005 ms cpu, 11361->11410->11317 MB, 11411 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 27 @47.949s 9%: 0.045+808+0.002 ms clock, 0.090+236/410/849+0.005 ms cpu, 11376->11432->11368 MB, 11415 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 28 @48.876s 9%: 0.047+1243+0.002 ms clock, 0.094+192/636/735+0.005 ms cpu, 11395->11883->11830 MB, 11368 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 29 @50.199s 10%: 0.74+1430+0.020 ms clock, 1.4+15/727/984+0.041 ms cpu, 11830->12416->12140 MB, 11830 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 30 @51.769s 10%: 0.042+1392+0.018 ms clock, 0.084+30/704/977+0.036 ms cpu, 12140->12734->12400 MB, 12140 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 31 @53.329s 10%: 0.045+1462+0.010 ms clock, 0.090+27/740/1019+0.021 ms cpu, 12400->13040->12705 MB, 12400 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 32 @54.915s 11%: 0.060+1476+0.003 ms clock, 0.12+5.7/740/1110+0.006 ms cpu, 12705->13309->12952 MB, 12705 MB goal, 0 MB stacks, 0 MB globals, 2 P
[51] 2023/12/12 17:22:13.420795 [WRN] RAFT [DkfQicS9 - _meta_] Falling behind in health check, commit 62401 != applied 62317
[51] 2023/12/12 17:22:13.420815 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
gc 33 @56.530s 11%: 0.048+1573+0.011 ms clock, 0.096+31/799/1026+0.022 ms cpu, 12952->13630->13280 MB, 12952 MB goal, 0 MB stacks, 0 MB globals, 2 P

healthy server:

gc 530 @1191.719s 0%: 0.094+29+0.002 ms clock, 0.18+0/14/33+0.005 ms cpu, 185->185->93 MB, 193 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 531 @1194.239s 0%: 0.030+29+0.002 ms clock, 0.060+0.60/16/31+0.005 ms cpu, 179->182->93 MB, 187 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 532 @1196.139s 0%: 0.030+34+0.012 ms clock, 0.061+0.30/18/38+0.025 ms cpu, 177->179->96 MB, 186 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 533 @1198.923s 0%: 0.10+29+0.011 ms clock, 0.20+0.15/17/39+0.023 ms cpu, 184->184->94 MB, 193 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 534 @1201.180s 0%: 0.034+35+0.002 ms clock, 0.069+0.48/13/35+0.004 ms cpu, 180->182->100 MB, 189 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 535 @1203.166s 0%: 0.065+26+0.002 ms clock, 0.13+0.96/14/28+0.005 ms cpu, 192->195->97 MB, 201 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 536 @1205.201s 0%: 0.029+19+0.002 ms clock, 0.058+0.95/10/22+0.005 ms cpu, 188->188->95 MB, 196 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 537 @1207.095s 0%: 0.030+28+0.002 ms clock, 0.060+0.14/15/29+0.004 ms cpu, 183->186->93 MB, 191 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 538 @1209.422s 0%: 0.034+29+0.003 ms clock, 0.068+0.21/13/39+0.006 ms cpu, 180->181->93 MB, 187 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 539 @1212.180s 0%: 0.055+25+0.002 ms clock, 0.11+0.10/11/34+0.004 ms cpu, 179->180->93 MB, 187 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 540 @1214.583s 0%: 0.032+26+0.002 ms clock, 0.064+0.37/13/30+0.005 ms cpu, 179->181->88 MB, 187 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 541 @1216.284s 0%: 0.033+21+0.011 ms clock, 0.067+0.22/4.3/38+0.022 ms cpu, 169->169->95 MB, 177 MB goal, 0 MB stacks, 0 MB globals, 2 P
gc 542 @1219.198s 0%: 0.031+31+0.010 ms clock, 0.062+0.10/17/29+0.021 ms cpu, 183->187->95 MB, 191 MB goal, 0 MB stacks, 0 MB globals, 2 P

Profiling was enabled and two dumps during the crash loop show a consistent growth from msgFromBuf:

go tool pprof -base allocs\(1\) allocs\(2\)
File: nats-server
Type: alloc_space
Time: Dec 12, 2023 at 10:37am (MST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 7.21GB, 98.60% of 7.32GB total
Dropped 47 nodes (cum <= 0.04GB)
Showing top 10 nodes out of 26
      flat  flat%   sum%        cum   cum%
    5.97GB 81.58% 81.58%     5.98GB 81.72%  github.com/nats-io/nats-server/v2/server.(*msgBlock).msgFromBuf
    0.36GB  4.90% 86.49%     0.36GB  4.90%  github.com/nats-io/nats-server/v2/server.glob..func7
    0.27GB  3.75% 90.24%     0.75GB 10.24%  github.com/nats-io/nats-server/v2/server.(*raft).decodeAppendEntry
    0.26GB  3.51% 93.75%     3.64GB 49.75%  github.com/nats-io/nats-server/v2/server.(*raft).applyCommit
    0.21GB  2.86% 96.61%     6.78GB 92.59%  github.com/nats-io/nats-server/v2/server.(*raft).loadEntry
    0.12GB  1.59% 98.20%     0.12GB  1.59%  github.com/nats-io/nats-server/v2/server.glob..func8
    0.03GB  0.41% 98.60%     3.67GB 50.15%  github.com/nats-io/nats-server/v2/server.(*raft).processAppendEntry
         0     0% 98.60%     7.12GB 97.37%  github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine.func1
         0     0% 98.60%     7.12GB 97.36%  github.com/nats-io/nats-server/v2/server.(*Server).startRaftNode
         0     0% 98.60%     5.82GB 79.50%  github.com/nats-io/nats-server/v2/server.(*fileStore).LoadMsg

Seems like there could be two issues at hand, or one leading to the other, with the inability to start streaming again after network recovery and the quick crash loop once recovery fails. I agree there is are a lot of moving pieces and setting up a time to chat would be super. Hopefully some of this information is helpful, please let me know if there is anything else I can provide in advance! Thank you so much!

derekcollison commented 9 months ago

Can you share the memory profile with us?

tylerschoppe commented 9 months ago

I will share it with you privately via slack to be mindful of potentially sensitive data

tylerschoppe commented 9 months ago

Following up here for continuity. In short, this issue arose as in the time it took to replicate a stream enough incoming data built up that when replication was completed the previously blocked data was pulled entirely into memory which exceeded server resources and ultimately OOM. Should be resolved in https://github.com/nats-io/nats-server/pull/4895#event-11285567706. Will confirm and close once 2.10.8 is released.

ioanluca commented 8 months ago

Hi guys 👋 I gave this issue a read, thanks for investigating.

Is it possible that this could be the cause for streams not achieving quorum? (slack thread for ctx https://natsio.slack.com/archives/CM3T6T7JQ/p1704735112191599)

I’m seeing the below when doing various admin operations from CLI or the node js /go sdk jetStream system temporarily unavailable (10008)

I am seeing these errors on the server as well JetStream cluster consumer '$G > streamname > consumername' has NO quorum, stalled.

JetStream cluster stream '$G > streamname' has NO quorum, stalled

the setup is 3 GCP VMs in an instance group, servers use config files and the nats-server binaries are started using -js

version is v2.9.16

nats servers seem to report healthy

derekcollison commented 8 months ago

Please upgrade to the latest server which is now 2.10.9.

steffenmllr commented 8 months ago

@derekcollison we see this happen on prod with 2.10.9-alpine:

GOMEMLIMIT=6GiB
resources.limits.memory: 10Gi
resources.request.memory: 8Gi

nats jetstream node comes out of sync and tries to catch up and gets OOMkilled from kubernetes

Edit: Increasing the Memory helped in my case.

  container:
    env:
      GOMEMLIMIT: 10GiB
    merge:
      resources:
        limits:
          cpu: "8"
          memory: 16Gi
        requests:
          cpu: "6"
          memory: 14Gi

tylerschoppe commented 8 months ago

Closing this issue as changes in 2.10.8 should resolve this issue. We recovered the system on our own so can't verify but changes appear to cover it. Thanks for the effort here @derekcollison

derekcollison commented 7 months ago

Please move to 2.10.9, which is the latest, if you use sources.

derekcollison commented 7 months ago

@steffenmllr if its more important to keep memory low you could limit the max catchup rate by setting max_outstanding_catchup which is in bytes in the jetstream block in the server configs.

steffenmllr commented 7 months ago

@derekcollison thank you for the hint. I'll add this to our configuration.

It might be beneficial for others to include this in the docs at the helm chart, where you provide sample limits and the GOMEMLIMIT. That is case of recovery you can limit the memory usage with max_outstanding_catchup to prevent getting OOMKilled

derekcollison commented 7 months ago

Agree, looping in @wallyqs and @caleblloyd.

nats-io / nats-server