nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.49k stars 1.38k forks source link

Unrecoverable JetStream cluster #3906

Closed matkam closed 1 year ago

matkam commented 1 year ago

Defect

First reported on Slack: https://natsio.slack.com/archives/C069GSYFP/p1676329368394969

Make sure that these boxes are checked before submitting your issue -- thank you!

/ # nats-server -DV
[84] 2023/02/24 04:17:32.753080 [INF] Starting nats-server
[84] 2023/02/24 04:17:32.753117 [INF]   Version:  2.9.14
[84] 2023/02/24 04:17:32.753119 [INF]   Git:      [74ae59a]
[84] 2023/02/24 04:17:32.753120 [DBG]   Go build: go1.19.5
[84] 2023/02/24 04:17:32.753122 [INF]   Name:     NCL3ETWJKRUBPYCG5IT2NQC7DKKVQUZFJDZANXLDOBJNIHQ56OKPGMHN
[84] 2023/02/24 04:17:32.753128 [INF]   ID:       NCL3ETWJKRUBPYCG5IT2NQC7DKKVQUZFJDZANXLDOBJNIHQ56OKPGMHN
[84] 2023/02/24 04:17:32.753141 [DBG] Created system account: "$SYS"
[84] 2023/02/24 04:17:32.754627 [FTL] Error listening on port: 0.0.0.0:4222, "listen tcp 0.0.0.0:4222: bind: address already in use"

Versions of nats-server and affected client libraries used:

NATS server: v2.9.11 and up. tested up to v2.9.14. NATS v2.9.10 is the last "good" release. Helm chart: 0.19.9

OS/Container environment:

Kubernetes, deployed with helm. values.yaml configuration:

nats:
  image:
    tag: 2.9.14-alpine

  serverNamePrefix: "${region}-"

  healthcheck:
    startup:
      initialDelaySeconds: 180

  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet

  resources:
    requests:
      cpu: 3500m
      memory: 7250Mi

  limits:
    maxPings: 4
    pingInterval: 1m

  logging:
    debug: false
    trace: false

  jetstream:
    enabled: true
    memStorage:
      enabled: true
    fileStorage:
      enabled: false

cluster:
  enabled: true
  name: ${name}
  replicas: 3

gateway:
  enabled: true
  name: ${name}
  ...

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
                - nats
        topologyKey: "kubernetes.io/hostname"

In our environment, we have a NATS super cluster, with 3 NATS servers per data center. We have multiple streams, each configured with 3 replicas and 1 minute MaxAge/TTL.

Steps or code to reproduce the issue:

We've been able to consistently reproduce the unrecoverable state by using some test code.

  1. Apply this tester deployment in one of the K8s clusters where NATS is running with JetStream enabled.
  2. Scale up the deployment to ~5 replicas.
  3. Wait a few minutes (less than 5) while checking to see if nats stream ls is still working.
  4. If Jetstream is still responsive, force kill one of the NATS servers: kubectl delete pod nats-x --grace-period=0 --force

Expected result:

JetStream continues to work, with a few errors about a single node being down.

Actual result:

JetStream is completely down, and cannot be recovered without destroying and recreating the entire super cluster.

NATS servers constantly printing these warnings:

[WRN] Consumer assignment for '$G > testStream:14 > 2AAcUVbn' not cleaned up, retrying

Screenshot from our NATS Grafana dashboard:

image
wallyqs commented 1 year ago

@matkam what are the resources for the nodes in the test? is it maximum 2 GB memory for the nats-server? Are you seeing any MemoryPressure events in the kubectl get events in your case? What I see is that there is an increasing number of slow consumers, if these are between the routes (rid) I think that might be the reason why there is no full mesh and thus JS is not available.

matkam commented 1 year ago

@wallyqs Each NATS pod has almost the entire 4 vCPU, 8GB memory VM to itself. I've updated the helm values with our resource requests and affinity settings. There are no MemoryPressure events.

That makes sense that there is no full mesh, but what has changed since v2.9.10 that is preventing the mesh from recovering?

derekcollison commented 1 year ago

We will get it fixed before 2.9.15 is released. Thanks for the info and the test case, much appreciated.

esemeniuc commented 1 year ago

Is this fixed now that 2.9.15 is released?

wallyqs commented 1 year ago

@esemeniuc most of it was addressed here so that it handles this scenario much better (50 memory streams each with 150 ephemeral push consumers and no flow control): https://github.com/nats-io/nats-server/pull/3922 but there are still some follow ups for when running into this condition in k8s like the startup probe causing extra restarts in some scenarios during the recovery. A workaround would be extend the startup period like below in k8s, but we'll likely revisit soon the heuristics around the startup probe to avoid this:

nats:
 healthcheck:
   startup:
     enabled: true
     failureThreshold: 300