Closed matkam closed 1 year ago
@matkam what are the resources for the nodes in the test? is it maximum 2 GB memory for the nats-server? Are you seeing any MemoryPressure events in the kubectl get events
in your case? What I see is that there is an increasing number of slow consumers, if these are between the routes (rid
) I think that might be the reason why there is no full mesh and thus JS is not available.
@wallyqs Each NATS pod has almost the entire 4 vCPU, 8GB memory VM to itself. I've updated the helm values with our resource requests and affinity settings. There are no MemoryPressure events.
That makes sense that there is no full mesh, but what has changed since v2.9.10 that is preventing the mesh from recovering?
We will get it fixed before 2.9.15 is released. Thanks for the info and the test case, much appreciated.
Is this fixed now that 2.9.15 is released?
@esemeniuc most of it was addressed here so that it handles this scenario much better (50 memory streams each with 150 ephemeral push consumers and no flow control): https://github.com/nats-io/nats-server/pull/3922 but there are still some follow ups for when running into this condition in k8s like the startup probe causing extra restarts in some scenarios during the recovery. A workaround would be extend the startup period like below in k8s, but we'll likely revisit soon the heuristics around the startup probe to avoid this:
nats:
healthcheck:
startup:
enabled: true
failureThreshold: 300
Defect
First reported on Slack: https://natsio.slack.com/archives/C069GSYFP/p1676329368394969
Make sure that these boxes are checked before submitting your issue -- thank you!
nats-server -DV
outputVersions of
nats-server
and affected client libraries used:NATS server: v2.9.11 and up. tested up to v2.9.14. NATS v2.9.10 is the last "good" release. Helm chart: 0.19.9
OS/Container environment:
Kubernetes, deployed with helm.
values.yaml
configuration:In our environment, we have a NATS super cluster, with 3 NATS servers per data center. We have multiple streams, each configured with 3 replicas and 1 minute MaxAge/TTL.
Steps or code to reproduce the issue:
We've been able to consistently reproduce the unrecoverable state by using some test code.
nats stream ls
is still working.kubectl delete pod nats-x --grace-period=0 --force
Expected result:
JetStream continues to work, with a few errors about a single node being down.
Actual result:
JetStream is completely down, and cannot be recovered without destroying and recreating the entire super cluster.
NATS servers constantly printing these warnings:
Screenshot from our NATS Grafana dashboard: