Open thorntonmc opened 2 months ago
Orphaned means the server could not find any meta assignment from the meta layer after synching with up.
Orphaned means the server could not find any meta assignment from the meta layer after synching with up.
Trying to understand this a bit more - here's the order of what happened:
If the stream doesn't exist in the meta layer after syncing up - why does the stream appear on the same node moments later?
Could you add some more information to restarts in a bad state
?
Could you add some more information to
restarts in a bad state
?
Here's the timeline - at 13:56UTC a new stream is created using the NATS client, bound to that node (gq-nats-1). We then notice these logs with what appears to be the node re-detecting every consumer for the stream - this happens several hundreds of times:
[1] 2024/05/02 13:56:20.039173 [INF] JetStream cluster new consumer leader for '$G > OR > [redacted]
After the "new consumer" logs stop - wee see these errors:
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.208975 [INF] JetStream Shutdown
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.069191 [INF] Initiating JetStream Shutdown...
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.039035 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: no message found
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.038791 [ERR] JetStream out of resources, will be DISABLED
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.038668 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: malformed or corrupt message
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.038632 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Critical write error: malformed or corrupt message
2024-05-02 09:56:31
[1] 2024/05/02 13:56:31.931646 [INF] JetStream cluster new stream leader for '$G > BR'
2024-05-02 09:56:27
[1] 2024/05/02 13:56:27.903391 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'
2024-05-02 09:56:27
[1] 2024/05/02 13:56:27.402715 [WRN] Internal subscription on "$JS.API.STREAM.INFO.BR" took too long: 3.508628383s
2024-05-02 09:56:27
[1] 2024/05/02 13:56:27.402701 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:27
[1] 2024/05/02 13:56:27.402561 [INF] JetStream cluster new stream leader for '$G > BR'
2024-05-02 09:56:27
[1] 2024/05/02 13:56:27.402500 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:24
[1] 2024/05/02 13:56:24.593972 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'
2024-05-02 09:56:24
[1] 2024/05/02 13:56:24.593812 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'
2024-05-02 09:56:24
[1] 2024/05/02 13:56:24.363958 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:23
[1] 2024/05/02 13:56:23.863979 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:23
[1] 2024/05/02 13:56:23.364262 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:22
[1] 2024/05/02 13:56:22.864354 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:22
[1] 2024/05/02 13:56:22.363929 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:21
[1] 2024/05/02 13:56:21.863344 [INF] Scaling down '$G > LR' to [gq-nats-1]
2024-05-02 09:56:21
[1] 2024/05/02 13:56:21.363622 [INF] Scaling down '$G > LR' to [gq-nats-1]
Followed by repeated logging of the following:
[1] 2024/05/02 13:56:55.927633 [WRN] JetStream cluster stream '$G > BR' has NO quorum, stalled
and
2024-05-02 09:56:33
[1] 2024/05/02 13:56:33.533104 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error apply commit for 2: raft: could not load entry from WAL
At this point - the node is unavailable, as are all the streams located on it - which prompts the restart of the cluster using kubectl rollout restart
.
It says it ran out of resources and shutdown JetStream. We should address that first.
Observed behavior
NATS recovered messages from a stream, but then deleted messages afterwards
Expected behavior
NATS should not delete the "OR" stream - as it and its consumers were recovered
Server and client version
2.9.15
Host environment
Running as a Kubernetes statefulset.
Steps to reproduce
The "OR" stream was unavailable at the time of restart. The OR stream runs on a single node - referred to here as gq-nats-1
A series of issues with NATS began after we created a new stream that was tag-located to gq-nats-1:
This then drove us to restart the service.