NATS R≥3 consumer stops working after all nodes are replaced

lionello commented 1 year ago

From Slack: https://natsio.slack.com/archives/CM3T6T7JQ/p1687452948911399

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

[x] Included nats-server -DV output
[x] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

https://gist.github.com/lionello/2566ffa84af8629630adde20407c5a99

Versions of `nats-server` and affected client libraries used:

server: docker.io/nats:2.9.18-alpine3.18
client: github.com/nats-io/nats.go v1.23.0

OS/Container environment:

AWS ECS Fargate

Steps or code to reproduce the issue:

I’m running a five-node NATS cluster with v2.9.18 on AWS ECS Fargate. These nodes only have ephemeral storage, so when a node stops, I handle the ECS “task stopped” event and explicitly remove the stopped node from the cluster by sending "$JS.API.SERVER.REMOVE". New nodes join the existing cluster and streams get rebalanced.
in nats-server.conf I enable JetStream, and declare two accounts, one is the system account, the other is for the different services (w/ jetstream: enabled); the latter account has 3 users, all using nkey: “fabric”, “cd”, and “lio” (myself, for debugging)
the (Go) “fabric” service connects with its cred+seed, initializes a JS stream w/ R=5, WorkQueuePolicy, MemoryStorage, Subjects “cd” and “cd.>”
the “fabric” service also creates a durable consumer named “wq” w/ AckExplicitPolicy, MaxDeliver=2, AckWait=10mins
the (Go) “cd” service connects with its cred+seed and does js.PullSubscribe("cd.>", "wq", nats.AckExplicit()) and sub.Fetch(1)

This setup has worked well for a long while. I can rotate all the nodes in the cluster (for example, when I upgrade the NATS base image) and all JS KV stores get rebalanced onto the new nodes.

Expected result:

I expect the consumer to continue to receive messages, even after all of the nodes get replaced by new ones.

Actual result:

Last week, all of a sudden the “cd” service was no longer receiving messages from the queue. The “fabric” service was publishing them but none were being fetched. The regular (“stan”) pub/sub was working fine, and even the JS KV buckets kept working as before. This problem also only appeared in our production environment. The same queue in our staging environment (with the exact same configuration, and the exact same changes and updates being applied to it) also kept working fine.

I was able to inspect the cluster with the nats-cli and I could see that the message timer on the stream got correctly reset each time a message was sent. However, the message count was always 0, even though the (single) “cd” subscriber never got any messages. The stream also was healthy, being balanced on all 5 nodes. Same for the consumer, it reported being on all 5 nodes. I forced a leader election, but that did not fix the issue. Restarting “cd” service, forcing it to subscribe again, also did not help. Finally, I deleted the consumer and recreated it and from that point on messages were moving again.

I now suspect that the way I manage the peers (ie. doing explicit peer removal) rebalances the streams correctly, but causes trouble for the durable consumers.

lionello commented 1 year ago

I think this might have been caused by this line: https://github.com/nats-io/nats-server/blob/main/server/jetstream_api.go#L2233

Sometimes the "server remove" API msg gets ignored because it's being handled by a peer that's not the leader (anymore) and so the node that's stopped is still being counted as a peer (NATS is waiting for that node to come back on line.)

derekcollison commented 1 year ago

You should make sure the cluster has a meta-leader and a stream leader and a consumer leader after the peer is stopped before you try to remove the peer etc..

What did nats consumer info show for the consumer that was not working correctly?

lionello commented 1 year ago

Unfortunately I hadn't captured the output of nats consumer info but from what I recall it looked pretty healthy. 5 nodes and one was starred. Whether all those 5 nodes were indeed active (or nodes that left) I didn't check, so I have to assume that perhaps the consumer had no quorum.

derekcollison commented 1 year ago

The * meant that peer was the leader. So it had a leader but possibly had a state that was ahead of the stream so was waiting for the stream to catch up?

lionello commented 1 year ago

This just happened again: nats server report js shows a healthy 5 node cluster, with a leader. /jsz reports "cluster_size":5, /routesz reports each node has 4 connections, etc... but none of the nats stream … commands return any data, nor does nats consumer …. It appears that the stream leaders cannot be reached: those were probably on nodes that got replaced.

So while the underlying NATS raft cluster is healthy (5 healthy nodes in a up-to-date cluster), the JS streams and consumer are not.

derekcollison commented 1 year ago

So a few things would be helpful. All via the nats cli (an updated version if possible), using the system account and the jetstream account.

As the system user..

nats server ls nats server report jetstream nats traffic - This will run continuously, let it run for a bit and take a screenshot if possible.

As the JetStream user..

nats stream report

lionello commented 1 year ago

I had tried nats stream report but it would consistently time-out. In fact, all the nats consumer … and nats stream … subcommands timed out.

My bad for not copy-pasting the nats server report js, but it looked clean. Didn't know about nats traffic; will def do that next time. Unfortunately I'm confident it'll happen again :(

derekcollison commented 1 year ago

Please consider upgrading to 2.10.3 (or at least 2.9.23).

We do not want these types of things to happen, so want to work with the community and our customers to resolve any issues they are having with the system.

lionello commented 1 year ago

It happened again, in my testing sandbox (where I run NATS with 2 nodes, understanding full well that's horrible for resiliency but I wanted a minimal setup that tests clustering):

$ nats server report js
╭────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                           JetStream Summary                                            │
├─────────────────┬─────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
│ Server          │ Cluster │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├─────────────────┼─────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ ip-10-0-33-218* │ NATS    │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 37      │ 13      │
│ ip-10-0-101-210 │ NATS    │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
├─────────────────┼─────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│                 │         │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 37      │ 13      │
╰─────────────────┴─────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯

╭───────────────────────────────────────────────────────────────────────╮
│                      RAFT Meta Group Information                      │
├─────────────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name            │ ID       │ Leader │ Current │ Online │ Active │ Lag │
├─────────────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ ip-10-0-101-210 │ UHLjQZvq │        │ true    │ true   │ 0.41s  │ 0   │
│ ip-10-0-33-218  │ nur0C9tr │ yes    │ true    │ true   │ 0.00s  │ 0   │
╰─────────────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯

$ nats server ls
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                        Server Overview                                                         │
├─────────────────┬─────────┬──────┬─────────┬─────┬───────┬──────┬────────┬─────┬────────┬───────┬───────┬──────┬────────┬──────┤
│ Name            │ Cluster │ Host │ Version │ JS  │ Conns │ Subs │ Routes │ GWs │ Mem    │ CPU % │ Cores │ Slow │ Uptime │ RTT  │
├─────────────────┼─────────┼──────┼─────────┼─────┼───────┼──────┼────────┼─────┼────────┼───────┼───────┼──────┼────────┼──────┤
│ ip-10-0-33-218  │ NATS    │ 0    │ 2.10.3  │ yes │ 2     │ 133  │ 4      │ 0   │ 17 MiB │ 1     │ 2     │ 0    │ 56m57s │ 18ms │
│ ip-10-0-101-210 │ NATS    │ 0    │ 2.10.3  │ yes │ 0     │ 133  │ 4      │ 0   │ 15 MiB │ 0     │ 2     │ 0    │ 29m44s │ 18ms │
├─────────────────┼─────────┼──────┼─────────┼─────┼───────┼──────┼────────┼─────┼────────┼───────┼───────┼──────┼────────┼──────┤
│                 │ 1       │ 2    │         │ 2   │ 2     │ 266  │        │     │ 32 MiB │       │       │ 0    │        │      │
╰─────────────────┴─────────┴──────┴─────────┴─────┴───────┴──────┴────────┴─────┴────────┴───────┴───────┴──────┴────────┴──────╯

╭────────────────────────────────────────────────────────────────────────────╮
│                              Cluster Overview                              │
├─────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├─────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ NATS    │ 2          │ 0                 │ 0                 │ 2           │
├─────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│         │ 2          │ 0                 │ 0                 │ 2           │
╰─────────┴────────────┴───────────────────┴───────────────────┴─────────────╯

$ nats stream report
Obtaining Stream stats

╭───────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                           Stream Report                                           │
├───────────┬─────────┬───────────┬───────────┬──────────┬───────┬──────┬─────────┬─────────────────┤
│ Stream    │ Storage │ Placement │ Consumers │ Messages │ Bytes │ Lost │ Deleted │ Replicas        │
├───────────┼─────────┼───────────┼───────────┼──────────┼───────┼──────┼─────────┼─────────────────┤
│ KV_keys   │ Memory  │           │ 0         │ 0        │ 0 B   │ 0    │ 0       │ ip-10-0-40-130! │
│ fn-defang │ Memory  │           │ 0         │ 0        │ 0 B   │ 0    │ 0       │ ip-10-0-40-130! │
╰───────────┴─────────┴───────────┴───────────┴──────────┴───────┴──────┴─────────┴─────────────────╯

Those two streams both have R=1. The replica ip-10-0-40-130 is not part of the raft cluster! It had been removed using $JS.API.SERVER.REMOVE. That the data of the streams is gone I kinda understand (the single replica was removed after all), but now the streams are unusable. They cannot be removed or recreated:

$ nats stream rm KV_keys        
? Really delete Stream KV_keys Yes
nats: error: could not remove Stream: stream is offline (10118)

$ nats traffic
╭─────────────────────────────────────────────────────────────────╮
│                          Raft Traffic                           │
├──────────┬──────┬────────┬─────────────┬───────┬────────────────┤
│ Proposal │ Vote │ Append │ Remove Peer │ Reply │ Total Messages │
├──────────┼──────┼────────┼─────────────┼───────┼────────────────┤
│ 0        │ 0    │ 1      │ 0           │ 1     │ 2              │
│ 0        │ 0    │ 1      │ 0           │ 1     │ 2              │
│ 0        │ 0    │ 1      │ 0           │ 1     │ 2              │
│ 0        │ 0    │ 1      │ 0           │ 1     │ 2              │
│ 0        │ 0    │ 1      │ 0           │ 1     │ 2              │
│ 0        │ 0    │ 1      │ 0           │ 1     │ 2              │
│ 0        │ 0    │ 1      │ 0           │ 1     │ 2              │
│ 0        │ 0    │ 1      │ 0           │ 1     │ 2              │
│ 0        │ 0    │ 1      │ 0           │ 1     │ 2              │
│ 0        │ 0    │ 1      │ 0           │ 1     │ 2              │
│ 0        │ 0    │ 1      │ 0           │ 1     │ 2              │
╰──────────┴──────┴────────┴─────────────┴───────┴────────────────╯

╭─────────────────────────────────────────────────────────────────────────────────╮
│                                 Cluster Traffic                                 │
├────────────┬─────────────┬───────────────┬─────────────┬───────┬────────────────┤
│ JSA Update │ Stream Info │ Consumer Info │ Stream Sync │ Reply │ Total Messages │
├────────────┼─────────────┼───────────────┼─────────────┼───────┼────────────────┤
│ 1          │ 0           │ 0             │ 0           │ 0     │ 1              │
│ 1          │ 0           │ 0             │ 0           │ 0     │ 1              │
│ 2          │ 0           │ 0             │ 0           │ 0     │ 2              │
│ 1          │ 0           │ 0             │ 0           │ 0     │ 1              │
│ 1          │ 0           │ 0             │ 0           │ 0     │ 1              │
│ 2          │ 0           │ 0             │ 0           │ 0     │ 2              │
│ 1          │ 0           │ 0             │ 0           │ 0     │ 1              │
│ 1          │ 0           │ 0             │ 0           │ 0     │ 1              │
│ 2          │ 0           │ 0             │ 0           │ 0     │ 2              │
│ 1          │ 0           │ 0             │ 0           │ 0     │ 1              │
│ 1          │ 0           │ 0             │ 0           │ 0     │ 1              │
╰────────────┴─────────────┴───────────────┴─────────────┴───────┴────────────────╯

╭───────────────────────────────────────────────────────────────────────────╮
│                              General Traffic                              │
├──────────┬────────┬────────┬────────┬──────┬─────────────┬────────────────┤
│ Requests │ JS API │ JS ACK │ System │ Rest │ Total Bytes │ Total Messages │
├──────────┼────────┼────────┼────────┼──────┼─────────────┼────────────────┤
│ 0        │ 0      │ 0      │ 1      │ 0    │ 1.4 KiB     │ 1              │
│ 0        │ 0      │ 0      │ 0      │ 0    │ 111 B       │ 0              │
│ 0        │ 0      │ 0      │ 0      │ 0    │ 155 B       │ 0              │
│ 0        │ 0      │ 0      │ 0      │ 0    │ 111 B       │ 0              │
│ 0        │ 0      │ 0      │ 0      │ 0    │ 111 B       │ 0              │
│ 0        │ 0      │ 0      │ 0      │ 0    │ 155 B       │ 0              │
│ 0        │ 0      │ 0      │ 0      │ 0    │ 111 B       │ 0              │
│ 0        │ 0      │ 0      │ 0      │ 0    │ 111 B       │ 0              │
│ 0        │ 0      │ 0      │ 0      │ 0    │ 155 B       │ 0              │
│ 0        │ 0      │ 0      │ 2      │ 0    │ 1.2 KiB     │ 2              │
│ 0        │ 0      │ 0      │ 0      │ 0    │ 111 B       │ 0              │
╰──────────┴────────┴────────┴────────┴──────┴─────────────┴────────────────╯

lionello commented 1 year ago

Oh, I noticed that I send $JS.API.SERVER.REMOVE as a JS user, not system user. It does result in the peer being removed from the raft cluster (bug?), but could it perhaps explain why the JS stream still has "removed" peers as replicas?

(Edited: I had the users flipped)

lionello commented 1 year ago

By the way, this was all done using the vanilla v2.10.3 docker image.

lionello commented 1 year ago

This issue seems related, but suggests that it's been fixed. I'm still seeing this happen with 2.10.3 though: https://github.com/nats-io/nats-server/issues/4396

derekcollison commented 1 year ago

If you have a full list of steps to reproduce that would be helpful. Otherwise we would have to find some time to jump on a call and diagnose the issue.

lionello commented 1 year ago

I've been working on a repro, and it does happen, but only 1 in 20 or so runs :( Happy to walk you through it over a video chat.

nats-io / nats-server