Open lionello opened 1 year ago
I think this might have been caused by this line: https://github.com/nats-io/nats-server/blob/main/server/jetstream_api.go#L2233
Sometimes the "server remove" API msg gets ignored because it's being handled by a peer that's not the leader (anymore) and so the node that's stopped is still being counted as a peer (NATS is waiting for that node to come back on line.)
You should make sure the cluster has a meta-leader and a stream leader and a consumer leader after the peer is stopped before you try to remove the peer etc..
What did nats consumer info
show for the consumer that was not working correctly?
Unfortunately I hadn't captured the output of nats consumer info
but from what I recall it looked pretty healthy. 5 nodes and one was starred. Whether all those 5 nodes were indeed active (or nodes that left) I didn't check, so I have to assume that perhaps the consumer had no quorum.
The *
meant that peer was the leader. So it had a leader but possibly had a state that was ahead of the stream so was waiting for the stream to catch up?
This just happened again: nats server report js
shows a healthy 5 node cluster, with a leader. /jsz
reports "cluster_size":5
, /routesz
reports each node has 4 connections, etc... but none of the nats stream …
commands return any data, nor does nats consumer …
. It appears that the stream leaders cannot be reached: those were probably on nodes that got replaced.
So while the underlying NATS raft cluster is healthy (5 healthy nodes in a up-to-date cluster), the JS streams and consumer are not.
So a few things would be helpful. All via the nats cli (an updated version if possible), using the system account and the jetstream account.
As the system user..
nats server ls
nats server report jetstream
nats traffic
- This will run continuously, let it run for a bit and take a screenshot if possible.
As the JetStream user..
nats stream report
I had tried nats stream report
but it would consistently time-out. In fact, all the nats consumer …
and nats stream …
subcommands timed out.
My bad for not copy-pasting the nats server report js
, but it looked clean. Didn't know about nats traffic
; will def do that next time. Unfortunately I'm confident it'll happen again :(
Please consider upgrading to 2.10.3 (or at least 2.9.23).
We do not want these types of things to happen, so want to work with the community and our customers to resolve any issues they are having with the system.
It happened again, in my testing sandbox (where I run NATS with 2 nodes, understanding full well that's horrible for resiliency but I wanted a minimal setup that tests clustering):
$ nats server report js
╭────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ JetStream Summary │
├─────────────────┬─────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
│ Server │ Cluster │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├─────────────────┼─────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ ip-10-0-33-218* │ NATS │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 37 │ 13 │
│ ip-10-0-101-210 │ NATS │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
├─────────────────┼─────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ │ │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 37 │ 13 │
╰─────────────────┴─────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯
╭───────────────────────────────────────────────────────────────────────╮
│ RAFT Meta Group Information │
├─────────────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name │ ID │ Leader │ Current │ Online │ Active │ Lag │
├─────────────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ ip-10-0-101-210 │ UHLjQZvq │ │ true │ true │ 0.41s │ 0 │
│ ip-10-0-33-218 │ nur0C9tr │ yes │ true │ true │ 0.00s │ 0 │
╰─────────────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯
$ nats server ls
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Server Overview │
├─────────────────┬─────────┬──────┬─────────┬─────┬───────┬──────┬────────┬─────┬────────┬───────┬───────┬──────┬────────┬──────┤
│ Name │ Cluster │ Host │ Version │ JS │ Conns │ Subs │ Routes │ GWs │ Mem │ CPU % │ Cores │ Slow │ Uptime │ RTT │
├─────────────────┼─────────┼──────┼─────────┼─────┼───────┼──────┼────────┼─────┼────────┼───────┼───────┼──────┼────────┼──────┤
│ ip-10-0-33-218 │ NATS │ 0 │ 2.10.3 │ yes │ 2 │ 133 │ 4 │ 0 │ 17 MiB │ 1 │ 2 │ 0 │ 56m57s │ 18ms │
│ ip-10-0-101-210 │ NATS │ 0 │ 2.10.3 │ yes │ 0 │ 133 │ 4 │ 0 │ 15 MiB │ 0 │ 2 │ 0 │ 29m44s │ 18ms │
├─────────────────┼─────────┼──────┼─────────┼─────┼───────┼──────┼────────┼─────┼────────┼───────┼───────┼──────┼────────┼──────┤
│ │ 1 │ 2 │ │ 2 │ 2 │ 266 │ │ │ 32 MiB │ │ │ 0 │ │ │
╰─────────────────┴─────────┴──────┴─────────┴─────┴───────┴──────┴────────┴─────┴────────┴───────┴───────┴──────┴────────┴──────╯
╭────────────────────────────────────────────────────────────────────────────╮
│ Cluster Overview │
├─────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├─────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ NATS │ 2 │ 0 │ 0 │ 2 │
├─────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ │ 2 │ 0 │ 0 │ 2 │
╰─────────┴────────────┴───────────────────┴───────────────────┴─────────────╯
$ nats stream report
Obtaining Stream stats
╭───────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Stream Report │
├───────────┬─────────┬───────────┬───────────┬──────────┬───────┬──────┬─────────┬─────────────────┤
│ Stream │ Storage │ Placement │ Consumers │ Messages │ Bytes │ Lost │ Deleted │ Replicas │
├───────────┼─────────┼───────────┼───────────┼──────────┼───────┼──────┼─────────┼─────────────────┤
│ KV_keys │ Memory │ │ 0 │ 0 │ 0 B │ 0 │ 0 │ ip-10-0-40-130! │
│ fn-defang │ Memory │ │ 0 │ 0 │ 0 B │ 0 │ 0 │ ip-10-0-40-130! │
╰───────────┴─────────┴───────────┴───────────┴──────────┴───────┴──────┴─────────┴─────────────────╯
Those two streams both have R=1. The replica ip-10-0-40-130
is not part of the raft cluster! It had been removed using $JS.API.SERVER.REMOVE
. That the data of the streams is gone I kinda understand (the single replica was removed after all), but now the streams are unusable. They cannot be removed or recreated:
$ nats stream rm KV_keys
? Really delete Stream KV_keys Yes
nats: error: could not remove Stream: stream is offline (10118)
$ nats traffic
╭─────────────────────────────────────────────────────────────────╮
│ Raft Traffic │
├──────────┬──────┬────────┬─────────────┬───────┬────────────────┤
│ Proposal │ Vote │ Append │ Remove Peer │ Reply │ Total Messages │
├──────────┼──────┼────────┼─────────────┼───────┼────────────────┤
│ 0 │ 0 │ 1 │ 0 │ 1 │ 2 │
│ 0 │ 0 │ 1 │ 0 │ 1 │ 2 │
│ 0 │ 0 │ 1 │ 0 │ 1 │ 2 │
│ 0 │ 0 │ 1 │ 0 │ 1 │ 2 │
│ 0 │ 0 │ 1 │ 0 │ 1 │ 2 │
│ 0 │ 0 │ 1 │ 0 │ 1 │ 2 │
│ 0 │ 0 │ 1 │ 0 │ 1 │ 2 │
│ 0 │ 0 │ 1 │ 0 │ 1 │ 2 │
│ 0 │ 0 │ 1 │ 0 │ 1 │ 2 │
│ 0 │ 0 │ 1 │ 0 │ 1 │ 2 │
│ 0 │ 0 │ 1 │ 0 │ 1 │ 2 │
╰──────────┴──────┴────────┴─────────────┴───────┴────────────────╯
╭─────────────────────────────────────────────────────────────────────────────────╮
│ Cluster Traffic │
├────────────┬─────────────┬───────────────┬─────────────┬───────┬────────────────┤
│ JSA Update │ Stream Info │ Consumer Info │ Stream Sync │ Reply │ Total Messages │
├────────────┼─────────────┼───────────────┼─────────────┼───────┼────────────────┤
│ 1 │ 0 │ 0 │ 0 │ 0 │ 1 │
│ 1 │ 0 │ 0 │ 0 │ 0 │ 1 │
│ 2 │ 0 │ 0 │ 0 │ 0 │ 2 │
│ 1 │ 0 │ 0 │ 0 │ 0 │ 1 │
│ 1 │ 0 │ 0 │ 0 │ 0 │ 1 │
│ 2 │ 0 │ 0 │ 0 │ 0 │ 2 │
│ 1 │ 0 │ 0 │ 0 │ 0 │ 1 │
│ 1 │ 0 │ 0 │ 0 │ 0 │ 1 │
│ 2 │ 0 │ 0 │ 0 │ 0 │ 2 │
│ 1 │ 0 │ 0 │ 0 │ 0 │ 1 │
│ 1 │ 0 │ 0 │ 0 │ 0 │ 1 │
╰────────────┴─────────────┴───────────────┴─────────────┴───────┴────────────────╯
╭───────────────────────────────────────────────────────────────────────────╮
│ General Traffic │
├──────────┬────────┬────────┬────────┬──────┬─────────────┬────────────────┤
│ Requests │ JS API │ JS ACK │ System │ Rest │ Total Bytes │ Total Messages │
├──────────┼────────┼────────┼────────┼──────┼─────────────┼────────────────┤
│ 0 │ 0 │ 0 │ 1 │ 0 │ 1.4 KiB │ 1 │
│ 0 │ 0 │ 0 │ 0 │ 0 │ 111 B │ 0 │
│ 0 │ 0 │ 0 │ 0 │ 0 │ 155 B │ 0 │
│ 0 │ 0 │ 0 │ 0 │ 0 │ 111 B │ 0 │
│ 0 │ 0 │ 0 │ 0 │ 0 │ 111 B │ 0 │
│ 0 │ 0 │ 0 │ 0 │ 0 │ 155 B │ 0 │
│ 0 │ 0 │ 0 │ 0 │ 0 │ 111 B │ 0 │
│ 0 │ 0 │ 0 │ 0 │ 0 │ 111 B │ 0 │
│ 0 │ 0 │ 0 │ 0 │ 0 │ 155 B │ 0 │
│ 0 │ 0 │ 0 │ 2 │ 0 │ 1.2 KiB │ 2 │
│ 0 │ 0 │ 0 │ 0 │ 0 │ 111 B │ 0 │
╰──────────┴────────┴────────┴────────┴──────┴─────────────┴────────────────╯
Oh, I noticed that I send $JS.API.SERVER.REMOVE
as a JS user, not system user. It does result in the peer being removed from the raft cluster (bug?), but could it perhaps explain why the JS stream still has "removed" peers as replicas?
(Edited: I had the users flipped)
By the way, this was all done using the vanilla v2.10.3 docker image.
This issue seems related, but suggests that it's been fixed. I'm still seeing this happen with 2.10.3 though: https://github.com/nats-io/nats-server/issues/4396
If you have a full list of steps to reproduce that would be helpful. Otherwise we would have to find some time to jump on a call and diagnose the issue.
I've been working on a repro, and it does happen, but only 1 in 20 or so runs :( Happy to walk you through it over a video chat.
From Slack: https://natsio.slack.com/archives/CM3T6T7JQ/p1687452948911399
Defect
Make sure that these boxes are checked before submitting your issue -- thank you!
nats-server -DV
outputhttps://gist.github.com/lionello/2566ffa84af8629630adde20407c5a99
Versions of
nats-server
and affected client libraries used:OS/Container environment:
AWS ECS Fargate
Steps or code to reproduce the issue:
"$JS.API.SERVER.REMOVE"
. New nodes join the existing cluster and streams get rebalanced.js.PullSubscribe("cd.>", "wq", nats.AckExplicit())
andsub.Fetch(1)
This setup has worked well for a long while. I can rotate all the nodes in the cluster (for example, when I upgrade the NATS base image) and all JS KV stores get rebalanced onto the new nodes.
Expected result:
I expect the consumer to continue to receive messages, even after all of the nodes get replaced by new ones.
Actual result:
Last week, all of a sudden the “cd” service was no longer receiving messages from the queue. The “fabric” service was publishing them but none were being fetched. The regular (“stan”) pub/sub was working fine, and even the JS KV buckets kept working as before. This problem also only appeared in our production environment. The same queue in our staging environment (with the exact same configuration, and the exact same changes and updates being applied to it) also kept working fine.
I was able to inspect the cluster with the nats-cli and I could see that the message timer on the stream got correctly reset each time a message was sent. However, the message count was always 0, even though the (single) “cd” subscriber never got any messages. The stream also was healthy, being balanced on all 5 nodes. Same for the consumer, it reported being on all 5 nodes. I forced a leader election, but that did not fix the issue. Restarting “cd” service, forcing it to subscribe again, also did not help. Finally, I deleted the consumer and recreated it and from that point on messages were moving again.
I now suspect that the way I manage the peers (ie. doing explicit peer removal) rebalances the streams correctly, but causes trouble for the durable consumers.