Closed aksdb closed 3 years ago
Have you seen this behavior at all outside of k8s, meaning if you just run the servers directly on the machine?
I'm able to reproduce the "catchup stalled" message and differing message count on my Linux laptop, with the 2.6.2 Go binaries (f7c3ac5), using the provided configs and data: https://cloud.aksdb.de/s/ecg53yRQ2C8ps3n/download/jetstream_inconsistent_211021.tar.xz. I didn't use Docker or Kubernetes.
Here's how I setup the issue.
# Terminal 1
$ nats-server-2.6.2 -D -c nats0.conf
# Terminal 2
$ nats-server-2.6.2 -D -c nats1.conf
# Terminal 3
$ nats-server-2.6.2 -D -c nats2.conf
# Terminal 4 (this triggers the issue)
$ nats pub core.jobs.notifications.SendPush foobar
After publishing, this is what we get.
# Terminal 1
RAFT [EEUOgTxU - C-R3F-TZm4O8z6] Finished catching up
# Terminal 2
RAFT [FFfB0Dc4 - S-R3F-bSTo304i] Finished catching up
# Terminal 3
[WRN] Got stream sequence mismatch for '$G > corejobs'
[DBG] SYSTEM - System connection closed: Internal Client
[DBG] RAFT [VpUo9efM - S-R3F-bSTo304i] Deleted
[DBG] RAFT [VpUo9efM - S-R3F-bSTo304i] AppendEntry did not match 136 37194 with 0 0
[DBG] RAFT [VpUo9efM - S-R3F-bSTo304i] Truncating and repairing WAL
[DBG] RAFT [VpUo9efM - S-R3F-bSTo304i] Canceling catchup subscription since we are now up to date
[DBG] RAFT [VpUo9efM - S-R3F-bSTo304i] Pausing apply channel
[DBG] RAFT [VpUo9efM - S-R3F-bSTo304i] Paused, not applying 37194
[WRN] Catchup for stream '$G > corejobs' stalled
There is also this line in nats2 logs that seems suspicious. It's printed before the nats pub
is executed. Maybe there was an issue with a previous operation that is causing an issue with the nats pub
. 🤔
[DBG] JetStream cluster failed to delete msg 96351 from stream "corejobs" for account "": stream store EOF
This is the full log for nats2: https://gist.github.com/variadico/cf7ef670ec642d160cb6b270ff59a85e
Have you seen this behavior at all outside of k8s, meaning if you just run the servers directly on the machine?
@derekcollison I wasn't able to reproduce it like that locally. I also haven't seen this on our staging or dev k8s clusters. On our second k8s production cluster I only saw it once right after a rolling update.
So no question, that something in our cluster is problematic. I suspect I/O spikes. But I also think that no matter how ugly the circumstances, JetStream should never linger in a state where it thinks it's healthy but actually isn't. Further more it should be able to recover from such a state. (I assume detection and repair go hand in hand here.)
@variadico Thank you very much for reproducing it and narrowing down the problematic log lines. I was a bit overwhelmed by the debug output, since it was essentially an endless tracing of the chatter between the nodes. So thanks again for doing what I couldn't.
I totally agree, just trying to do root cause analysis and to get a repeatable problem.
I totally agree, just trying to do root cause analysis and to get a repeatable problem.
Two things make that cluster in question special (in comparison to the other clusters):
I tried to simulate both locally. Permanently writing to the stream is easy, but getting weird I/O behavior not so much. I tried using slowfs, but apart from temporary stream outages, I never got it into a state where it didn't recover. Which, IMO, is also how I expected it to be. That shitty conditions cause the brokers and databases to be unreliable at times is unavoidable.
Of course I would also still like to get to the point where I can reliably force JetStream into this corner where an inconsistent state occurs; but I hope the "capture" of such a state is at least better than nothing and at least helps root out an edge case where the stream is and stays "bad".
Agree we need to deal with any bad conditions that happen, and we will.
Seems most folks who do not run inside of Docker or K8S do not see any of these issues.
Agree we need to deal with any bad conditions that happen, and we will.
Seems most folks who do not run inside of Docker or K8S do not see any of these issues.
Interesting. Maybe there is a relation to my other ticket #2586. I'll try next week if I can reproduce it with slow I/O and some cgroup limits (especially CPU). Maybe I/O isn't even the problem and it's just CPU being contended? We'll see - maybe :-)
How do you stop and upgrade servers in your setup?
Can you share the docker compose (or equivalent)?
In our cluster I use the official Helm chart. So a rollout is a normal rolling shutdown / wait / start / wait-for-ready / continue cycle.
The state I attached to the ticket didn't involve a restart / update though.
nats-cluster-0 [6] 2021/10/21 10:15:06.458597 [WRN] JetStream cluster consumer '$G > corejobs > notifications-SendPush-worker' has NO quorum, stalled.
nats-cluster-0 [6] 2021/10/21 10:15:06.711453 [WRN] JetStream cluster consumer '$G > corejobs > news-ResendNotification-worker' has NO quorum, stalled.
nats-cluster-0 [6] 2021/10/21 10:15:06.839319 [WRN] JetStream cluster consumer '$G > corejobs > chat-SendMessageNotification-worker' has NO quorum, stalled.
nats-cluster-0 [6] 2021/10/21 10:15:06.894058 [WRN] JetStream cluster consumer '$G > corejobs > test-LoadTest-worker' has NO quorum, stalled.
nats-cluster-0 [6] 2021/10/21 10:15:09.717079 [WRN] JetStream cluster consumer '$G > corejobs > notifications-ProcessBucket-worker' has NO quorum, stalled.
nats-cluster-0 [6] 2021/10/21 10:15:12.330567 [WRN] 172.25.78.236:41840 - rid:52 - Readloop processing time: 8.380560422s
nats-cluster-0 [6] 2021/10/21 10:15:12.330040 [WRN] 172.25.85.191:60190 - rid:57 - Readloop processing time: 8.387265973s
nats-cluster-0 [6] 2021/10/21 10:15:12.331993 [INF] JetStream cluster new consumer leader for '$G > corejobs > chat-SendMessageNotification-worker'
nats-cluster-2 [6] 2021/10/21 10:15:12.349176 [INF] JetStream cluster new consumer leader for '$G > corejobs > news-ResendNotification-worker'
nats-cluster-2 [6] 2021/10/21 10:15:13.244992 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-SendPush-worker'
nats-cluster-2 [6] 2021/10/21 10:15:13.693212 [INF] JetStream cluster new metadata leader
nats-cluster-2 [6] 2021/10/21 10:15:14.157454 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-ProcessBucket-worker'
nats-cluster-2 [6] 2021/10/21 10:15:14.337494 [INF] JetStream cluster new consumer leader for '$G > corejobs > test-LoadTest-worker'
nats-cluster-1 [7] 2021/10/21 10:15:14.713538 [INF] JetStream cluster new consumer leader for '$G > corejobs > accessors-BranchDeletion-worker'
nats-cluster-2 [6] 2021/10/21 10:15:16.456698 [INF] JetStream cluster new stream leader for '$G > corejobs'
nats-cluster-1 [7] 2021/10/21 10:16:18.214599 [WRN] JetStream cluster consumer '$G > corejobs > accessors-BranchDeletion-worker' has NO quorum, stalled.
nats-cluster-1 [7] 2021/10/21 10:16:20.325396 [WRN] JetStream cluster consumer '$G > corejobs > notifications-SendPush-worker' has NO quorum, stalled.
nats-cluster-1 [7] 2021/10/21 10:16:19.377413 [WRN] JetStream cluster consumer '$G > corejobs > test-LoadTest-worker' has NO quorum, stalled.
nats-cluster-1 [7] 2021/10/21 10:16:21.187389 [WRN] 172.25.85.191:39974 - rid:40 - Readloop processing time: 6.227492094s
nats-cluster-1 [7] 2021/10/21 10:16:21.190026 [WRN] 172.25.32.54:6222 - rid:28 - Readloop processing time: 5.854337346s
nats-cluster-1 [7] 2021/10/21 10:16:21.198780 [INF] JetStream cluster new consumer leader for '$G > corejobs > accessors-BranchDeletion-worker'
nats-cluster-0 [6] 2021/10/21 10:16:22.111586 [INF] JetStream cluster new consumer leader for '$G > corejobs > test-LoadTest-worker'
nats-cluster-2 [6] 2021/10/21 10:16:22.470980 [INF] JetStream cluster new consumer leader for '$G > corejobs > news-ResendNotification-worker'
nats-cluster-1 [7] 2021/10/21 10:16:23.288197 [INF] JetStream cluster new metadata leader
nats-cluster-0 [6] 2021/10/21 10:16:23.444740 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-SendPush-worker'
nats-cluster-0 [6] 2021/10/21 10:16:23.463988 [INF] JetStream cluster new stream leader for '$G > corejobs'
nats-cluster-2 [6] 2021/10/21 10:16:23.803501 [INF] JetStream cluster new consumer leader for '$G > corejobs > chat-SendMessageNotification-worker'
nats-cluster-0 [6] 2021/10/21 10:16:24.154671 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-ProcessBucket-worker'
nats-cluster-1 [7] 2021/10/21 10:52:20.376785 [WRN] 172.25.32.54:6222 - rid:28 - Readloop processing time: 2.810676199s
nats-cluster-0 [6] 2021/10/21 10:52:22.092391 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-ProcessBucket-worker'
nats-cluster-0 [6] 2021/10/21 10:52:22.679432 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-SendPush-worker'
nats-cluster-1 [7] 2021/10/21 10:52:22.993683 [INF] JetStream cluster new consumer leader for '$G > corejobs > test-LoadTest-worker'
nats-cluster-0 [6] 2021/10/21 10:52:23.324343 [INF] JetStream cluster new stream leader for '$G > corejobs'
nats-cluster-2 [6] 2021/10/21 12:12:10.694680 [WRN] 172.25.78.236:6222 - rid:28 - Readloop processing time: 2.078807609s
nats-cluster-1 [7] 2021/10/21 12:27:35.994643 [WRN] JetStream cluster consumer '$G > corejobs > test-LoadTest-worker' has NO quorum, stalled.
nats-cluster-1 [7] 2021/10/21 12:27:37.520335 [WRN] 172.25.32.54:6222 - rid:28 - Readloop processing time: 4.244688192s
nats-cluster-1 [7] 2021/10/21 12:27:37.519815 [WRN] 172.25.85.191:39974 - rid:40 - Readloop processing time: 4.244481097s
nats-cluster-1 [7] 2021/10/21 12:27:36.200588 [WRN] JetStream cluster consumer '$G > corejobs > accessors-BranchDeletion-worker' has NO quorum, stalled.
nats-cluster-0 [6] 2021/10/21 12:27:37.837675 [INF] JetStream cluster new metadata leader
nats-cluster-2 [6] 2021/10/21 12:27:38.359006 [INF] JetStream cluster new consumer leader for '$G > corejobs > test-LoadTest-worker'
nats-cluster-1 [7] 2021/10/21 12:27:38.379962 [INF] JetStream cluster new consumer leader for '$G > corejobs > accessors-BranchDeletion-worker'
nats-cluster-2 [6] 2021/10/21 12:27:39.882060 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-ProcessBucket-worker'
nats-cluster-2 [6] 2021/10/21 12:27:39.907110 [INF] JetStream cluster new consumer leader for '$G > corejobs > news-ResendNotification-worker'
nats-cluster-2 [6] 2021/10/21 12:27:40.018424 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-SendPush-worker'
nats-cluster-0 [6] 2021/10/21 12:27:42.680692 [INF] JetStream cluster new consumer leader for '$G > corejobs > chat-SendMessageNotification-worker'
nats-cluster-2 [6] 2021/10/21 12:56:07.861834 [WRN] JetStream cluster consumer '$G > corejobs > test-LoadTest-worker' has NO quorum, stalled.
nats-cluster-1 [7] 2021/10/21 12:56:11.474594 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-SendPush-worker'
nats-cluster-0 [6] 2021/10/21 12:56:11.476528 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-ProcessBucket-worker'
nats-cluster-2 [6] 2021/10/21 12:56:11.471118 [INF] JetStream cluster new metadata leader
nats-cluster-2 [6] 2021/10/21 12:56:11.470126 [WRN] 172.25.32.54:6222 - rid:29 - Readloop processing time: 6.251512553s
nats-cluster-2 [6] 2021/10/21 12:56:11.470015 [WRN] 172.25.78.236:6222 - rid:28 - Readloop processing time: 6.247640631s
nats-cluster-2 [6] 2021/10/21 12:56:08.018829 [WRN] JetStream cluster consumer '$G > corejobs > notifications-SendPush-worker' has NO quorum, stalled.
nats-cluster-2 [6] 2021/10/21 12:56:07.907370 [WRN] JetStream cluster consumer '$G > corejobs > news-ResendNotification-worker' has NO quorum, stalled.
nats-cluster-2 [6] 2021/10/21 12:56:07.883074 [WRN] JetStream cluster consumer '$G > corejobs > notifications-ProcessBucket-worker' has NO quorum, stalled.
nats-cluster-1 [7] 2021/10/21 12:56:13.725496 [INF] JetStream cluster new consumer leader for '$G > corejobs > test-LoadTest-worker'
nats-cluster-1 [7] 2021/10/21 12:56:13.725122 [WRN] 172.25.32.54:6222 - rid:28 - Readloop processing time: 2.244656749s
nats-cluster-2 [6] 2021/10/21 12:56:14.053563 [INF] JetStream cluster new consumer leader for '$G > corejobs > news-ResendNotification-worker'
nats-cluster-0 [6] 2021/10/21 12:56:14.328513 [INF] JetStream cluster new metadata leader
nats-cluster-1 [7] 2021/10/21 12:56:15.135272 [INF] JetStream cluster new consumer leader for '$G > corejobs > accessors-BranchDeletion-worker'
nats-cluster-1 [7] 2021/10/21 12:56:15.358747 [INF] JetStream cluster new stream leader for '$G > corejobs'
nats-cluster-0 [6] 2021/10/21 12:56:15.816220 [INF] JetStream cluster new consumer leader for '$G > corejobs > chat-SendMessageNotification-worker'
nats-cluster-2 [6] 2021/10/21 12:56:18.911421 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-ProcessBucket-worker'
nats-cluster-2 [6] 2021/10/21 12:56:20.402100 [WRN] Catchup for stream '$G > corejobs' stalled
nats-cluster-0 [6] 2021/10/21 12:56:23.074277 [INF] JetStream cluster new consumer leader for '$G > corejobs > notifications-SendPush-worker'
nats-cluster-2 [6] 2021/10/21 12:56:25.402974 [WRN] Catchup for stream '$G > corejobs' stalled
nats-cluster-2 [6] 2021/10/21 12:56:30.404120 [WRN] Catchup for stream '$G > corejobs' stalled
nats-cluster-2 [6] 2021/10/21 12:56:35.404930 [WRN] Catchup for stream '$G > corejobs' stalled
nats-cluster-2 [6] 2021/10/21 12:56:40.405942 [WRN] Catchup for stream '$G > corejobs' stalled
nats-cluster-2 [6] 2021/10/21 12:56:45.406325 [WRN] Catchup for stream '$G > corejobs' stalled
nats-cluster-2 [6] 2021/10/21 12:56:50.406662 [WRN] Catchup for stream '$G > corejobs' stalled
nats-cluster-2 [6] 2021/10/21 12:56:55.407585 [WRN] Catchup for stream '$G > corejobs' stalled
As you can see, there was a leader change (due to slowness) at 10:15 UTC. All was still good afterwards. Another leader change occurred 10:52 UTC, then another one at 12:27 and finally at 12:56. Afterwards apparently the stream starts to be async, since it endlessly warns about the failing catchup.
So I heavily suspect that a leader change is part of the problem (that would also match with the problem occuring after some rolling updates, since those will also trigger leader changes). Obviously there's no general problem with leader changes (otherwise the problem would happen all the time), so there is pretty likely some timing related component.
Hi guys. We have the same issue on our k8s cluster with DRBD pvc's running a nats cluster with 5 replicas. We also think those issues are IO spikes. The nats cluster runs without issues for days and if there is high load on our k8s cluster things on nats gets ugly.
I am digging into this today, will keep everyone updated.
So this one has lots of moving pieces. Been working on it for about 5hrs. Cleaned up some stuff so that is good. And I have another test to write and make sure the fix works, however at the end of it in the setup you sent the metadata snapshot seems incomplete meaning that the stream assignment is missing core data that is needed to re-sync any out of date replicas.
Will continue to probe and see if we can try to recover. Did you notice any serious errors from any of the servers complaining about failed writes etc?
ok I found the source of the bug. Will fix that but also try to make it such that the cluster can recover from it if possible.
Meaning the bug led to a bad state and then additional corruption. I fixed the source of the bug but also will make sure we can recover your corrupt state properly.
Did you notice any serious errors from any of the servers complaining about failed writes etc?
No. The logs I posted above are the only ones from that time that the nats-server pods emitted. Also so far the storage layer of our cluster seems reliable aside from the mentioned latency spikes. We also run quite a lot of databases with much more I/O load on the same kind of storage and didn't see anything that would indicate data loss.
Before the edit you asked:
Another question as I continue to investigate and debug, did you perform stream update actions to the corejobs stream?
That's actually interesting, because our services, when starting up, always try to update the stream (and if that fails, assume it doesn't exist, and then try to add it). This is still more or less a relict from my learning-jetstream-phase when I thought I could update most of the stream settings. Practically I can't update many of them anyway, so a better pattern would likely be to check if the stream exists (by fetching it) and only if that fails creating it.
But in any case: yes, we often "update" the stream, but without effects (so the settings we "update" them with stay the same).
ok I found the source of the bug. Will fix that but also try to make it such that the cluster can recover from it if possible.
Very very cool! Thanks a lot!
I hope it's possible that NATS can "throw away" metadata of a node and rebuild from the remaining cluster members. So I guess the complicated part is for a node to discover that it actually is irreparably damaged.
I am happy with the results, the dataset you sent us, even with expiration of msgs disabled, could be successfully repaired. And of course the original bug is fixed as well.
Should merge today and we will cut a release.
/cc @kozlovic
2.6.3 has been released with this fix. Just upgrade server and the system will repair and stabilize, no other action items needed from you.
Thanks again.
We seem to be having a really similar issue to this... but running on 2.10.10.
Defect
Make sure that these boxes are checked before submitting your issue -- thank you!
nats-server -DV
outputVersions of
nats-server
and affected client libraries used:nats:2.6.2-alpine
OS/Container environment:
k8s
Context
Catchup for stream '$G > corejobs' stalled
Steps or code to reproduce the issue:
nats-cluster-2
is not the leader. (usenats stream cluster step-down
if necessary)nats pub core.jobs.notifications.SendPush foobar
nats-cluster-2
(which now throws warnings)Additionally: check the message count in the stream; when
nats-cluster-2
is the leader, it differs from whennats-cluster-0
ornats-cluster-1
are leaders.Expected result:
Either:
Also: JetStream should always be able to recover from inconsistent states (especially if there is still a majority of healthy nodes around).
Actual result:
Additionally: depending on who becomes leader, the amount of messages varies (obviously, since it's no longer synching). Node0: 310 messages, Node1: 310 messages, Node2: 81 messages