Closed thorntonmc closed 2 months ago
Orphaned means the server could not find any meta assignment from the meta layer after synching with up.
Orphaned means the server could not find any meta assignment from the meta layer after synching with up.
Trying to understand this a bit more - here's the order of what happened:
If the stream doesn't exist in the meta layer after syncing up - why does the stream appear on the same node moments later?
Could you add some more information to restarts in a bad state
?
Could you add some more information to
restarts in a bad state
?
Here's the timeline - at 13:56UTC a new stream is created using the NATS client, bound to that node (gq-nats-1). We then notice these logs with what appears to be the node re-detecting every consumer for the stream - this happens several hundreds of times:
[1] 2024/05/02 13:56:20.039173 [INF] JetStream cluster new consumer leader for '$G > OR > [redacted]
After the "new consumer" logs stop - wee see these errors:
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.208975 [INF] JetStream Shutdown
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.069191 [INF] Initiating JetStream Shutdown...
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.039035 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: no message found
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.038791 [ERR] JetStream out of resources, will be DISABLED
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.038668 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: malformed or corrupt message
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.038632 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Critical write error: malformed or corrupt message
2024-05-02 09:56:31
[1] 2024/05/02 13:56:31.931646 [INF] JetStream cluster new stream leader for '$G > BR'
2024-05-02 09:56:27
[1] 2024/05/02 13:56:27.903391 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'
2024-05-02 09:56:27
[1] 2024/05/02 13:56:27.402715 [WRN] Internal subscription on "$JS.API.STREAM.INFO.BR" took too long: 3.508628383s
2024-05-02 09:56:27
[1] 2024/05/02 13:56:27.402701 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:27
[1] 2024/05/02 13:56:27.402561 [INF] JetStream cluster new stream leader for '$G > BR'
2024-05-02 09:56:27
[1] 2024/05/02 13:56:27.402500 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:24
[1] 2024/05/02 13:56:24.593972 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'
2024-05-02 09:56:24
[1] 2024/05/02 13:56:24.593812 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'
2024-05-02 09:56:24
[1] 2024/05/02 13:56:24.363958 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:23
[1] 2024/05/02 13:56:23.863979 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:23
[1] 2024/05/02 13:56:23.364262 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:22
[1] 2024/05/02 13:56:22.864354 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:22
[1] 2024/05/02 13:56:22.363929 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:21
[1] 2024/05/02 13:56:21.863344 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:21
[1] 2024/05/02 13:56:21.363622 [INF] Scaling down '$G > LR' to [gq-nats-1]
Followed by repeated logging of the following:
[1] 2024/05/02 13:56:55.927633 [WRN] JetStream cluster stream '$G > BR' has NO quorum, stalled
and
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.533104 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error apply commit for 2: raft: could not load entry from WAL
At this point - the node is unavailable, as are all the streams located on it - which prompts the restart of the cluster using kubectl rollout restart
.
It says it ran out of resources and shutdown JetStream. We should address that first.
@derekcollison I've just run into a similar occurrence on NATS 2.9.20 with a cluster of 3, and my stream of only 1 replica getting wiped. Not sure if it's the same issue as OP though, let me know if you'd rather I create a separate issue.
tl;dr: Stream "AAAAA" was recovered but then wiped in the same process by nats-server after a restart (due to underlying jetstream volume was 100% full)
Cluster:
Streams:
nats-1
nats-1
outageSequence of events:
nats-1
servernats-1
pod (before checking anything, obviously restart typically fixes everything)
nats-1
was restarted again, which is when the "AAAAA" stream was recovered then deleted (nats-1 is online with the cluster now)Restored 19,792,763 messages for stream '$G > AAAAA'
on the curent process logsWarning logs excerpt:
[52] 2024/08/19 02:26:52.911978 [WRN] Detected orphaned stream '$G > AAAAA', will cleanup
[52] 2024/08/19 02:26:53.012803 [WRN] Waiting for routing to be established...
[52] 2024/08/19 02:26:53.231558 [WRN] Detected orphaned stream '$G > BBBBB', will cleanup
[52] 2024/08/19 02:26:53.244121 [WRN] RAFT [yrzKKRBu - _meta_] Falling behind in health check, commit 3128454 != applied 0
[52] 2024/08/19 02:26:53.244132 [WRN] JetStream is not current with the meta leader
Fixed via #5767
Observed behavior
NATS recovered messages from a stream, but then deleted messages afterwards
Expected behavior
NATS should not delete the "OR" stream - as it and its consumers were recovered
Server and client version
2.9.15
Host environment
Running as a Kubernetes statefulset.
Steps to reproduce
The "OR" stream was unavailable at the time of restart. The OR stream runs on a single node - referred to here as gq-nats-1
A series of issues with NATS began after we created a new stream that was tag-located to gq-nats-1:
This then drove us to restart the service.