Closed JohanLindvall closed 7 months ago
What specifically breaks with both enabled?
Some things actually break when you disable the reloader, for example updating Cluster Auth.
3-node Cluster Auth change WIth Reloader and Checksum
3-node Cluster Auth change WIthout Reloader but With Checksum
I quite often get loops of
"2024-01-24 08:47:58.786
[1] 2024/01/24 07:47:58.786813 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
logged with the reloader and the checksum. After disabling the reloader, these issues appear to be gone.
Let me know if more information is needed. I can provide debug logs.
What portion of the config is changing when you see those?
Also, is it just one pod logging that, or is it all of them at the same time?
It was almost exclusively gateway configuration changes that triggered this. For example, enable / disabling the gateway.
That sounds right, the entire Super Cluster is part of the RAFT Meta group. If the Gateway connection is severed and the Meta Leader was only accessible over the Gateway, then JetStream would lose connection with the Meta Leader.
If the majority of the servers were on the side of the Gateway that was not severed, they would re-elect a new leader.
OK, thanks for the answer. This is not a bug but a misunderstanding on my part. Would the cluster(s) heal if the gateway becomes available again?
If the gateway comes back then yes it should reconnect, you should be able to watch what is happening by running nats server report jetstream
on both sides of the Gateway, breaking the Gateway connection, then adding it back. Here are some more docs that may be helpful:
What version were you using?
nats:2.10.9-alpine
What environment was the server running in?
nats:2.10.9-alpine, AKS
Is this defect reproducible?
Yes, minor config changes tend to break the cluster formation frequently.
Given the capability you are leveraging, describe your expectation?
I would expect the reloader to be disabled and configChecksumAnnotation to be enabled, as the latter uses the built-in Kubernetes rolling restart functionality.
Given the expectation, what is the defect you are observing?
I believe the double restarts mechanisms cause cluster stability issues w.r.t raft leadership and minor config changes will cause the cluster to fail to form properly.