JetStream Cluster becomes inconsistent: catchup for stream stalled

aksdb commented 3 years ago

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

[ ] Included nats-server -DV output
[x] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

Versions of `nats-server` and affected client libraries used:

nats:2.6.2-alpine

OS/Container environment:

k8s

Context

I don't know how this problem arises, but it happens every few hours or days at max.
The production system this is captured on has a message rate of about 1/s, with a few bursts every few hours of maybe 5/s (at least according to Grafana).
The cluster consists of three nodes, and a stream with 3 replicas.
After hours/days, one of the nodes start emitting Catchup for stream '$G > corejobs' stalled
All consumers, where this node is the leader, stop getting messages.
I dumped the jetstream data directory so you can get a look at the current state.

Steps or code to reproduce the issue:

Start the cluster with the data and config provided here.
Make sure, nats-cluster-2 is not the leader. (use nats stream cluster step-down if necessary)
Send a message to the stream: nats pub core.jobs.notifications.SendPush foobar
Observe the logs of nats-cluster-2 (which now throws warnings)

Additionally: check the message count in the stream; when nats-cluster-2 is the leader, it differs from when nats-cluster-0 or nats-cluster-1 are leaders.

Expected result:

Either:

All three nodes have the same data.
The node which can't catch up is marked faulty and steps down from all leader positions

Also: JetStream should always be able to recover from inconsistent states (especially if there is still a majority of healthy nodes around).

Actual result:

The cluster thinks it's healthy.
One node doesn't receive any data anymore.

Additionally: depending on who becomes leader, the amount of messages varies (obviously, since it's no longer synching). Node0: 310 messages, Node1: 310 messages, Node2: 81 messages

derekcollison commented 3 years ago

Have you seen this behavior at all outside of k8s, meaning if you just run the servers directly on the machine?

variadico commented 3 years ago

I'm able to reproduce the "catchup stalled" message and differing message count on my Linux laptop, with the 2.6.2 Go binaries (f7c3ac5), using the provided configs and data: https://cloud.aksdb.de/s/ecg53yRQ2C8ps3n/download/jetstream_inconsistent_211021.tar.xz. I didn't use Docker or Kubernetes.

Here's how I setup the issue.

# Terminal 1
$ nats-server-2.6.2 -D -c nats0.conf

# Terminal 2
$ nats-server-2.6.2 -D -c nats1.conf

# Terminal 3
$ nats-server-2.6.2 -D -c nats2.conf

# Terminal 4 (this triggers the issue)
$ nats pub core.jobs.notifications.SendPush foobar

After publishing, this is what we get.

# Terminal 1
RAFT [EEUOgTxU - C-R3F-TZm4O8z6] Finished catching up

# Terminal 2
RAFT [FFfB0Dc4 - S-R3F-bSTo304i] Finished catching up

# Terminal 3
[WRN] Got stream sequence mismatch for '$G > corejobs'
[DBG] SYSTEM - System connection closed: Internal Client
[DBG] RAFT [VpUo9efM - S-R3F-bSTo304i] Deleted
[DBG] RAFT [VpUo9efM - S-R3F-bSTo304i] AppendEntry did not match 136 37194 with 0 0
[DBG] RAFT [VpUo9efM - S-R3F-bSTo304i] Truncating and repairing WAL
[DBG] RAFT [VpUo9efM - S-R3F-bSTo304i] Canceling catchup subscription since we are now up to date
[DBG] RAFT [VpUo9efM - S-R3F-bSTo304i] Pausing apply channel
[DBG] RAFT [VpUo9efM - S-R3F-bSTo304i] Paused, not applying 37194
[WRN] Catchup for stream '$G > corejobs' stalled

There is also this line in nats2 logs that seems suspicious. It's printed before the nats pub is executed. Maybe there was an issue with a previous operation that is causing an issue with the nats pub. 🤔

[DBG] JetStream cluster failed to delete msg 96351 from stream "corejobs" for account "": stream store EOF

This is the full log for nats2: https://gist.github.com/variadico/cf7ef670ec642d160cb6b270ff59a85e

aksdb commented 3 years ago

Have you seen this behavior at all outside of k8s, meaning if you just run the servers directly on the machine?

@derekcollison I wasn't able to reproduce it like that locally. I also haven't seen this on our staging or dev k8s clusters. On our second k8s production cluster I only saw it once right after a rolling update.

So no question, that something in our cluster is problematic. I suspect I/O spikes. But I also think that no matter how ugly the circumstances, JetStream should never linger in a state where it thinks it's healthy but actually isn't. Further more it should be able to recover from such a state. (I assume detection and repair go hand in hand here.)

@variadico Thank you very much for reproducing it and narrowing down the problematic log lines. I was a bit overwhelmed by the debug output, since it was essentially an endless tracing of the chatter between the nodes. So thanks again for doing what I couldn't.

derekcollison commented 3 years ago

I totally agree, just trying to do root cause analysis and to get a repeatable problem.

aksdb commented 3 years ago

I totally agree, just trying to do root cause analysis and to get a repeatable problem.

Two things make that cluster in question special (in comparison to the other clusters):

there's basically always someone writing to the stream (in the other clusters it's more like bursts)
the network mounted I/O storage can have unpredictable latency spikes

I tried to simulate both locally. Permanently writing to the stream is easy, but getting weird I/O behavior not so much. I tried using slowfs, but apart from temporary stream outages, I never got it into a state where it didn't recover. Which, IMO, is also how I expected it to be. That shitty conditions cause the brokers and databases to be unreliable at times is unavoidable.

Of course I would also still like to get to the point where I can reliably force JetStream into this corner where an inconsistent state occurs; but I hope the "capture" of such a state is at least better than nothing and at least helps root out an edge case where the stream is and stays "bad".

derekcollison commented 3 years ago

Agree we need to deal with any bad conditions that happen, and we will.

Seems most folks who do not run inside of Docker or K8S do not see any of these issues.

aksdb commented 3 years ago

Agree we need to deal with any bad conditions that happen, and we will.

Seems most folks who do not run inside of Docker or K8S do not see any of these issues.

Interesting. Maybe there is a relation to my other ticket #2586. I'll try next week if I can reproduce it with slow I/O and some cgroup limits (especially CPU). Maybe I/O isn't even the problem and it's just CPU being contended? We'll see - maybe :-)

derekcollison commented 3 years ago

How do you stop and upgrade servers in your setup?

Can you share the docker compose (or equivalent)?

aksdb commented 3 years ago

In our cluster I use the official Helm chart. So a rollout is a normal rolling shutdown / wait / start / wait-for-ready / continue cycle.

The state I attached to the ticket didn't involve a restart / update though.

aksdb commented 3 years ago

nats-cluster-0  [6] 2021/10/21 10:15:06.458597 [WRN] JetStream cluster consumer '$G > corejobs > notifications-SendPush-worker' has NO quorum, stalled.
nats-cluster-0  [6] 2021/10/21 10:15:06.711453 [WRN] JetStream cluster consumer '$G > corejobs > news-ResendNotification-worker' has NO quorum, stalled.    
nats-cluster-0  [6] 2021/10/21 10:15:06.839319 [WRN] JetStream cluster consumer '$G > corejobs > chat-SendMessageNotification-worker' has NO quorum, stalled.   
nats-cluster-0  [6] 2021/10/21 10:15:06.894058 [WRN] JetStream cluster consumer '$G > corejobs > test-LoadTest-worker' has NO quorum, stalled.  
nats-cluster-0  [6] 2021/10/21 10:15:09.717079 [WRN] JetStream cluster consumer '$G > corejobs > notifications-ProcessBucket-worker' has NO quorum, stalled.    
nats-cluster-0  [6] 2021/10/21 10:15:12.330567 [WRN] 172.25.78.236:41840 - rid:52 - Readloop processing time: 8.380560422s  
nats-cluster-0  [6] 2021/10/21 10:15:12.330040 [WRN] 172.25.85.191:60190 - rid:57 - Readloop processing time: 8.387265973s
nats-cluster-0  [6] 2021/10/21 10:15:12.331993 [INF] JetStream cluster new consumer leader for '$G > corejobs > chat-SendMessageNotification-worker'    
nats-cluster-2  [6] 2021/10/21 10:15:12.349176 [INF] JetStream cluster new consumer leader for '$G > corejobs > news-ResendNotification-worker' 
nats-cluster-2  [6] 2021/10/21 10:15:13.244992 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-SendPush-worker'  
nats-cluster-2  [6] 2021/10/21 10:15:13.693212 [INF] JetStream cluster new metadata leader  
nats-cluster-2  [6] 2021/10/21 10:15:14.157454 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-ProcessBucket-worker' 
nats-cluster-2  [6] 2021/10/21 10:15:14.337494 [INF] JetStream cluster new consumer leader for '$G > corejobs > test-LoadTest-worker'   
nats-cluster-1  [7] 2021/10/21 10:15:14.713538 [INF] JetStream cluster new consumer leader for '$G > corejobs > accessors-BranchDeletion-worker'
nats-cluster-2  [6] 2021/10/21 10:15:16.456698 [INF] JetStream cluster new stream leader for '$G > corejobs'    
nats-cluster-1  [7] 2021/10/21 10:16:18.214599 [WRN] JetStream cluster consumer '$G > corejobs > accessors-BranchDeletion-worker' has NO quorum, stalled.   
nats-cluster-1  [7] 2021/10/21 10:16:20.325396 [WRN] JetStream cluster consumer '$G > corejobs > notifications-SendPush-worker' has NO quorum, stalled. 
nats-cluster-1  [7] 2021/10/21 10:16:19.377413 [WRN] JetStream cluster consumer '$G > corejobs > test-LoadTest-worker' has NO quorum, stalled.  
nats-cluster-1  [7] 2021/10/21 10:16:21.187389 [WRN] 172.25.85.191:39974 - rid:40 - Readloop processing time: 6.227492094s  
nats-cluster-1  [7] 2021/10/21 10:16:21.190026 [WRN] 172.25.32.54:6222 - rid:28 - Readloop processing time: 5.854337346s    
nats-cluster-1  [7] 2021/10/21 10:16:21.198780 [INF] JetStream cluster new consumer leader for '$G > corejobs > accessors-BranchDeletion-worker'    
nats-cluster-0  [6] 2021/10/21 10:16:22.111586 [INF] JetStream cluster new consumer leader for '$G > corejobs > test-LoadTest-worker'
nats-cluster-2  [6] 2021/10/21 10:16:22.470980 [INF] JetStream cluster new consumer leader for '$G > corejobs > news-ResendNotification-worker' 
nats-cluster-1  [7] 2021/10/21 10:16:23.288197 [INF] JetStream cluster new metadata leader  
nats-cluster-0  [6] 2021/10/21 10:16:23.444740 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-SendPush-worker'  
nats-cluster-0  [6] 2021/10/21 10:16:23.463988 [INF] JetStream cluster new stream leader for '$G > corejobs'    
nats-cluster-2  [6] 2021/10/21 10:16:23.803501 [INF] JetStream cluster new consumer leader for '$G > corejobs > chat-SendMessageNotification-worker'    
nats-cluster-0  [6] 2021/10/21 10:16:24.154671 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-ProcessBucket-worker' 
nats-cluster-1  [7] 2021/10/21 10:52:20.376785 [WRN] 172.25.32.54:6222 - rid:28 - Readloop processing time: 2.810676199s    
nats-cluster-0  [6] 2021/10/21 10:52:22.092391 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-ProcessBucket-worker'
nats-cluster-0  [6] 2021/10/21 10:52:22.679432 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-SendPush-worker'  
nats-cluster-1  [7] 2021/10/21 10:52:22.993683 [INF] JetStream cluster new consumer leader for '$G > corejobs > test-LoadTest-worker'   
nats-cluster-0  [6] 2021/10/21 10:52:23.324343 [INF] JetStream cluster new stream leader for '$G > corejobs'    
nats-cluster-2  [6] 2021/10/21 12:12:10.694680 [WRN] 172.25.78.236:6222 - rid:28 - Readloop processing time: 2.078807609s   
nats-cluster-1  [7] 2021/10/21 12:27:35.994643 [WRN] JetStream cluster consumer '$G > corejobs > test-LoadTest-worker' has NO quorum, stalled.  
nats-cluster-1  [7] 2021/10/21 12:27:37.520335 [WRN] 172.25.32.54:6222 - rid:28 - Readloop processing time: 4.244688192s    
nats-cluster-1  [7] 2021/10/21 12:27:37.519815 [WRN] 172.25.85.191:39974 - rid:40 - Readloop processing time: 4.244481097s  
nats-cluster-1  [7] 2021/10/21 12:27:36.200588 [WRN] JetStream cluster consumer '$G > corejobs > accessors-BranchDeletion-worker' has NO quorum, stalled.   
nats-cluster-0  [6] 2021/10/21 12:27:37.837675 [INF] JetStream cluster new metadata leader
nats-cluster-2  [6] 2021/10/21 12:27:38.359006 [INF] JetStream cluster new consumer leader for '$G > corejobs > test-LoadTest-worker'   
nats-cluster-1  [7] 2021/10/21 12:27:38.379962 [INF] JetStream cluster new consumer leader for '$G > corejobs > accessors-BranchDeletion-worker'    
nats-cluster-2  [6] 2021/10/21 12:27:39.882060 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-ProcessBucket-worker' 
nats-cluster-2  [6] 2021/10/21 12:27:39.907110 [INF] JetStream cluster new consumer leader for '$G > corejobs > news-ResendNotification-worker' 
nats-cluster-2  [6] 2021/10/21 12:27:40.018424 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-SendPush-worker'  
nats-cluster-0  [6] 2021/10/21 12:27:42.680692 [INF] JetStream cluster new consumer leader for '$G > corejobs > chat-SendMessageNotification-worker'    
nats-cluster-2  [6] 2021/10/21 12:56:07.861834 [WRN] JetStream cluster consumer '$G > corejobs > test-LoadTest-worker' has NO quorum, stalled.
nats-cluster-1  [7] 2021/10/21 12:56:11.474594 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-SendPush-worker'  
nats-cluster-0  [6] 2021/10/21 12:56:11.476528 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-ProcessBucket-worker' 
nats-cluster-2  [6] 2021/10/21 12:56:11.471118 [INF] JetStream cluster new metadata leader  
nats-cluster-2  [6] 2021/10/21 12:56:11.470126 [WRN] 172.25.32.54:6222 - rid:29 - Readloop processing time: 6.251512553s    
nats-cluster-2  [6] 2021/10/21 12:56:11.470015 [WRN] 172.25.78.236:6222 - rid:28 - Readloop processing time: 6.247640631s   
nats-cluster-2  [6] 2021/10/21 12:56:08.018829 [WRN] JetStream cluster consumer '$G > corejobs > notifications-SendPush-worker' has NO quorum, stalled. 
nats-cluster-2  [6] 2021/10/21 12:56:07.907370 [WRN] JetStream cluster consumer '$G > corejobs > news-ResendNotification-worker' has NO quorum, stalled.    
nats-cluster-2  [6] 2021/10/21 12:56:07.883074 [WRN] JetStream cluster consumer '$G > corejobs > notifications-ProcessBucket-worker' has NO quorum, stalled.
nats-cluster-1  [7] 2021/10/21 12:56:13.725496 [INF] JetStream cluster new consumer leader for '$G > corejobs > test-LoadTest-worker'   
nats-cluster-1  [7] 2021/10/21 12:56:13.725122 [WRN] 172.25.32.54:6222 - rid:28 - Readloop processing time: 2.244656749s    
nats-cluster-2  [6] 2021/10/21 12:56:14.053563 [INF] JetStream cluster new consumer leader for '$G > corejobs > news-ResendNotification-worker' 
nats-cluster-0  [6] 2021/10/21 12:56:14.328513 [INF] JetStream cluster new metadata leader  
nats-cluster-1  [7] 2021/10/21 12:56:15.135272 [INF] JetStream cluster new consumer leader for '$G > corejobs > accessors-BranchDeletion-worker'    
nats-cluster-1  [7] 2021/10/21 12:56:15.358747 [INF] JetStream cluster new stream leader for '$G > corejobs'    
nats-cluster-0  [6] 2021/10/21 12:56:15.816220 [INF] JetStream cluster new consumer leader for '$G > corejobs > chat-SendMessageNotification-worker'    
nats-cluster-2  [6] 2021/10/21 12:56:18.911421 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-ProcessBucket-worker' 
nats-cluster-2  [6] 2021/10/21 12:56:20.402100 [WRN] Catchup for stream '$G > corejobs' stalled
nats-cluster-0  [6] 2021/10/21 12:56:23.074277 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-SendPush-worker'  
nats-cluster-2  [6] 2021/10/21 12:56:25.402974 [WRN] Catchup for stream '$G > corejobs' stalled 
nats-cluster-2  [6] 2021/10/21 12:56:30.404120 [WRN] Catchup for stream '$G > corejobs' stalled 
nats-cluster-2  [6] 2021/10/21 12:56:35.404930 [WRN] Catchup for stream '$G > corejobs' stalled 
nats-cluster-2  [6] 2021/10/21 12:56:40.405942 [WRN] Catchup for stream '$G > corejobs' stalled 
nats-cluster-2  [6] 2021/10/21 12:56:45.406325 [WRN] Catchup for stream '$G > corejobs' stalled 
nats-cluster-2  [6] 2021/10/21 12:56:50.406662 [WRN] Catchup for stream '$G > corejobs' stalled 
nats-cluster-2  [6] 2021/10/21 12:56:55.407585 [WRN] Catchup for stream '$G > corejobs' stalled

As you can see, there was a leader change (due to slowness) at 10:15 UTC. All was still good afterwards. Another leader change occurred 10:52 UTC, then another one at 12:27 and finally at 12:56. Afterwards apparently the stream starts to be async, since it endlessly warns about the failing catchup.

So I heavily suspect that a leader change is part of the problem (that would also match with the problem occuring after some rolling updates, since those will also trigger leader changes). Obviously there's no general problem with leader changes (otherwise the problem would happen all the time), so there is pretty likely some timing related component.

tobg commented 3 years ago

Hi guys. We have the same issue on our k8s cluster with DRBD pvc's running a nats cluster with 5 replicas. We also think those issues are IO spikes. The nats cluster runs without issues for days and if there is high load on our k8s cluster things on nats gets ugly.

derekcollison commented 3 years ago

I am digging into this today, will keep everyone updated.

derekcollison commented 3 years ago

So this one has lots of moving pieces. Been working on it for about 5hrs. Cleaned up some stuff so that is good. And I have another test to write and make sure the fix works, however at the end of it in the setup you sent the metadata snapshot seems incomplete meaning that the stream assignment is missing core data that is needed to re-sync any out of date replicas.

Will continue to probe and see if we can try to recover. Did you notice any serious errors from any of the servers complaining about failed writes etc?

derekcollison commented 3 years ago

ok I found the source of the bug. Will fix that but also try to make it such that the cluster can recover from it if possible.

Meaning the bug led to a bad state and then additional corruption. I fixed the source of the bug but also will make sure we can recover your corrupt state properly.

aksdb commented 3 years ago

Did you notice any serious errors from any of the servers complaining about failed writes etc?

No. The logs I posted above are the only ones from that time that the nats-server pods emitted. Also so far the storage layer of our cluster seems reliable aside from the mentioned latency spikes. We also run quite a lot of databases with much more I/O load on the same kind of storage and didn't see anything that would indicate data loss.

Before the edit you asked:

Another question as I continue to investigate and debug, did you perform stream update actions to the corejobs stream?

That's actually interesting, because our services, when starting up, always try to update the stream (and if that fails, assume it doesn't exist, and then try to add it). This is still more or less a relict from my learning-jetstream-phase when I thought I could update most of the stream settings. Practically I can't update many of them anyway, so a better pattern would likely be to check if the stream exists (by fetching it) and only if that fails creating it.

But in any case: yes, we often "update" the stream, but without effects (so the settings we "update" them with stay the same).

ok I found the source of the bug. Will fix that but also try to make it such that the cluster can recover from it if possible.

Very very cool! Thanks a lot!

I hope it's possible that NATS can "throw away" metadata of a node and rebuild from the remaining cluster members. So I guess the complicated part is for a node to discover that it actually is irreparably damaged.

derekcollison commented 3 years ago

I am happy with the results, the dataset you sent us, even with expiration of msgs disabled, could be successfully repaired. And of course the original bug is fixed as well.

Should merge today and we will cut a release.

/cc @kozlovic

derekcollison commented 3 years ago

2.6.3 has been released with this fix. Just upgrade server and the system will repair and stabilize, no other action items needed from you.

Thanks again.

canuckasaurus commented 8 months ago

We seem to be having a really similar issue to this... but running on 2.10.10.

nats-io / nats-server