nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.27k stars 1.37k forks source link

NATS Deleting Recovered Stream as Orphaned #5382

Open thorntonmc opened 2 months ago

thorntonmc commented 2 months ago

Observed behavior

NATS recovered messages from a stream, but then deleted messages afterwards

[1] 2024/05/02 14:25:12.795323 [INF]   Max Storage:     1000.00 GB
[1] 2024/05/02 14:25:12.795329 [INF]   Store Directory: "/nats/jetstream"
[1] 2024/05/02 14:25:12.795333 [INF] -------------------------------------------
[1] 2024/05/02 14:25:12.796142 [INF]   Starting restore for stream '$G > BR'
[1] 2024/05/02 14:25:12.822254 [INF]   Restored 771 messages for stream '$G > BR'
[1] 2024/05/02 14:25:12.822472 [INF]   Starting restore for stream '$G > LR'
[1] 2024/05/02 14:25:12.822902 [INF]   Restored 0 messages for stream '$G > LR'
[1] 2024/05/02 14:25:12.823062 [INF]   Starting restore for stream '$G > OEN'
[1] 2024/05/02 14:25:12.823418 [INF]   Restored 0 messages for stream '$G > OEN'
[1] 2024/05/02 14:25:12.823531 [INF]   Starting restore for stream '$G > OR'
[1] 2024/05/02 14:25:29.868233 [INF]   Restored 447,917,984 messages for stream '$G > OR'
[1] 2024/05/02 14:25:29.868300 [INF]   Recovering 3 consumers for stream - '$G > OEN'
[1] 2024/05/02 14:25:29.870547 [INF]   Recovering 852 consumers for stream - '$G > OR'
[1] 2024/05/02 14:25:30.201230 [INF] Starting JetStream cluster
[1] 2024/05/02 14:25:30.201246 [INF] Creating JetStream metadata controller
[1] 2024/05/02 14:25:30.201507 [INF] JetStream cluster bootstrapping
[1] 2024/05/02 14:25:30.201980 [INF] Listening for client connections on 0.0.0.0:4222
[1] 2024/05/02 14:25:30.202065 [WRN] Detected orphaned stream '$G > BR', will cleanup
[1] 2024/05/02 14:25:30.202342 [INF] Server is ready
[1] 2024/05/02 14:25:30.202457 [INF] Cluster name is gq-nats
[1] 2024/05/02 14:25:30.202537 [INF] Listening for route connections on 0.0.0.0:6222
[1] 2024/05/02 14:25:30.208864 [ERR] Error trying to connect to route (attempt 1): lookup for host "gq-nats-0.gq-nats.generic-queue.svc.cluster.local": lookup gq-nats-0.gq-nats.generic-queue.svc.cluster.local on 10.96.0.10:53: no such host
[1] 2024/05/02 14:25:30.239648 [WRN] Detected orphaned stream '$G > LR', will cleanup
[1] 2024/05/02 14:25:30.240497 [WRN] Detected orphaned stream '$G > OEN', will cleanup
[1] 2024/05/02 14:25:30.243571 [WRN] Detected orphaned stream '$G > OR', will cleanup

Expected behavior

NATS should not delete the "OR" stream - as it and its consumers were recovered

[1] 2024/05/02 14:25:29.868233 [INF]   Restored 447,917,984 messages for stream '$G > OR'
[1] 2024/05/02 14:25:29.870547 [INF]   Recovering 852 consumers for stream - '$G > OR'

Server and client version

2.9.15

Host environment

Running as a Kubernetes statefulset.

Steps to reproduce

The "OR" stream was unavailable at the time of restart. The OR stream runs on a single node - referred to here as gq-nats-1

A series of issues with NATS began after we created a new stream that was tag-located to gq-nats-1:

2024-05-02 09:56:55 
[1] 2024/05/02 13:56:55.927633 [WRN] JetStream cluster stream '$G > BR' has NO quorum, stalled

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.538942 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.208975 [INF] JetStream Shutdown

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.069191 [INF] Initiating JetStream Shutdown...

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.039035 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: no message found

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.038791 [ERR] JetStream out of resources, will be DISABLED

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.038668 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: malformed or corrupt message

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.038632 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Critical write error: malformed or corrupt message

2024-05-02 09:56:31 
[1] 2024/05/02 13:56:31.931646 [INF] JetStream cluster new stream leader for '$G > BR'

2024-05-02 09:56:27 
[1] 2024/05/02 13:56:27.903391 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'

This then drove us to restart the service.

derekcollison commented 2 months ago

Orphaned means the server could not find any meta assignment from the meta layer after synching with up.

thorntonmc commented 2 months ago

Orphaned means the server could not find any meta assignment from the meta layer after synching with up.

Trying to understand this a bit more - here's the order of what happened:

  1. NATS restarts in a bad state
  2. The node in question comes up, sees messages and consumers from a stream called "OR", recovers them
  3. Doesn't see that stream in the meta layer, deletes the stream
  4. Later the same stream appears with 0 messages, on that same node.

If the stream doesn't exist in the meta layer after syncing up - why does the stream appear on the same node moments later?

derekcollison commented 2 months ago

Could you add some more information to restarts in a bad state?

thorntonmc commented 2 months ago

Could you add some more information to restarts in a bad state?

Here's the timeline - at 13:56UTC a new stream is created using the NATS client, bound to that node (gq-nats-1). We then notice these logs with what appears to be the node re-detecting every consumer for the stream - this happens several hundreds of times:

[1] 2024/05/02 13:56:20.039173 [INF] JetStream cluster new consumer leader for '$G > OR > [redacted]

After the "new consumer" logs stop - wee see these errors:

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.208975 [INF] JetStream Shutdown

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.069191 [INF] Initiating JetStream Shutdown...

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.039035 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: no message found

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.038791 [ERR] JetStream out of resources, will be DISABLED

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.038668 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: malformed or corrupt message

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.038632 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Critical write error: malformed or corrupt message

2024-05-02 09:56:31 
[1] 2024/05/02 13:56:31.931646 [INF] JetStream cluster new stream leader for '$G > BR'

2024-05-02 09:56:27 
[1] 2024/05/02 13:56:27.903391 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'

2024-05-02 09:56:27 
[1] 2024/05/02 13:56:27.402715 [WRN] Internal subscription on "$JS.API.STREAM.INFO.BR" took too long: 3.508628383s

2024-05-02 09:56:27 
[1] 2024/05/02 13:56:27.402701 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:27 
[1] 2024/05/02 13:56:27.402561 [INF] JetStream cluster new stream leader for '$G > BR'

2024-05-02 09:56:27 
[1] 2024/05/02 13:56:27.402500 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:24 
[1] 2024/05/02 13:56:24.593972 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'

2024-05-02 09:56:24 
[1] 2024/05/02 13:56:24.593812 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'

2024-05-02 09:56:24 
[1] 2024/05/02 13:56:24.363958 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:23 
[1] 2024/05/02 13:56:23.863979 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:23 
[1] 2024/05/02 13:56:23.364262 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:22 
[1] 2024/05/02 13:56:22.864354 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:22 
[1] 2024/05/02 13:56:22.363929 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:21 
[1] 2024/05/02 13:56:21.863344 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:21 
[1] 2024/05/02 13:56:21.363622 [INF] Scaling down '$G > LR' to [gq-nats-1]

Followed by repeated logging of the following:

[1] 2024/05/02 13:56:55.927633 [WRN] JetStream cluster stream '$G > BR' has NO quorum, stalled

and

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.533104 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error apply commit for 2: raft: could not load entry from WAL

At this point - the node is unavailable, as are all the streams located on it - which prompts the restart of the cluster using kubectl rollout restart.

derekcollison commented 2 months ago

It says it ran out of resources and shutdown JetStream. We should address that first.