nats-io / k8s

NATS on Kubernetes with Helm Charts
Apache License 2.0
446 stars 302 forks source link

Default values for nats enable configChecksumAnnotation and reloader. #860

Closed JohanLindvall closed 7 months ago

JohanLindvall commented 7 months ago

What version were you using?

nats:2.10.9-alpine

What environment was the server running in?

nats:2.10.9-alpine, AKS

Is this defect reproducible?

Yes, minor config changes tend to break the cluster formation frequently.

Given the capability you are leveraging, describe your expectation?

I would expect the reloader to be disabled and configChecksumAnnotation to be enabled, as the latter uses the built-in Kubernetes rolling restart functionality.

Given the expectation, what is the defect you are observing?

I believe the double restarts mechanisms cause cluster stability issues w.r.t raft leadership and minor config changes will cause the cluster to fail to form properly.

caleblloyd commented 7 months ago

What specifically breaks with both enabled?

Some things actually break when you disable the reloader, for example updating Cluster Auth.

3-node Cluster Auth change WIth Reloader and Checksum

  1. Reloader reloads pods 0 and 1 with new passwords, they update routes and start talking to each other
  2. K8s restarts pod 2, it comes up, can talk to pods 0 and 1
  3. Rolling update continues and everything works

3-node Cluster Auth change WIthout Reloader but With Checksum

  1. K8s restarts pod 2, it comes up, pod 1 and 0 don't have the new Cluster Auth so pod 2 can't form routes
  2. Pod 2 startup probe eventually fails because it can't reach leader
  3. Crash loop backoff
JohanLindvall commented 7 months ago

I quite often get loops of "2024-01-24 08:47:58.786
[1] 2024/01/24 07:47:58.786813 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"

logged with the reloader and the checksum. After disabling the reloader, these issues appear to be gone.

Let me know if more information is needed. I can provide debug logs.

caleblloyd commented 7 months ago

What portion of the config is changing when you see those?

Also, is it just one pod logging that, or is it all of them at the same time?

JohanLindvall commented 7 months ago

It was almost exclusively gateway configuration changes that triggered this. For example, enable / disabling the gateway.

caleblloyd commented 7 months ago

That sounds right, the entire Super Cluster is part of the RAFT Meta group. If the Gateway connection is severed and the Meta Leader was only accessible over the Gateway, then JetStream would lose connection with the Meta Leader.

If the majority of the servers were on the side of the Gateway that was not severed, they would re-elect a new leader.

JohanLindvall commented 7 months ago

OK, thanks for the answer. This is not a bug but a misunderstanding on my part. Would the cluster(s) heal if the gateway becomes available again?

caleblloyd commented 7 months ago

If the gateway comes back then yes it should reconnect, you should be able to watch what is happening by running nats server report jetstream on both sides of the Gateway, breaking the Gateway connection, then adding it back. Here are some more docs that may be helpful:

https://docs.nats.io/running-a-nats-service/configuration/clustering/jetstream_clustering/administration#system-level