nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.95k stars 1.41k forks source link

NATS Deleting Recovered Stream as Orphaned #5382

Closed thorntonmc closed 2 months ago

thorntonmc commented 6 months ago

Observed behavior

NATS recovered messages from a stream, but then deleted messages afterwards

[1] 2024/05/02 14:25:12.795323 [INF]   Max Storage:     1000.00 GB
[1] 2024/05/02 14:25:12.795329 [INF]   Store Directory: "/nats/jetstream"
[1] 2024/05/02 14:25:12.795333 [INF] -------------------------------------------
[1] 2024/05/02 14:25:12.796142 [INF]   Starting restore for stream '$G > BR'
[1] 2024/05/02 14:25:12.822254 [INF]   Restored 771 messages for stream '$G > BR'
[1] 2024/05/02 14:25:12.822472 [INF]   Starting restore for stream '$G > LR'
[1] 2024/05/02 14:25:12.822902 [INF]   Restored 0 messages for stream '$G > LR'
[1] 2024/05/02 14:25:12.823062 [INF]   Starting restore for stream '$G > OEN'
[1] 2024/05/02 14:25:12.823418 [INF]   Restored 0 messages for stream '$G > OEN'
[1] 2024/05/02 14:25:12.823531 [INF]   Starting restore for stream '$G > OR'
[1] 2024/05/02 14:25:29.868233 [INF]   Restored 447,917,984 messages for stream '$G > OR'
[1] 2024/05/02 14:25:29.868300 [INF]   Recovering 3 consumers for stream - '$G > OEN'
[1] 2024/05/02 14:25:29.870547 [INF]   Recovering 852 consumers for stream - '$G > OR'
[1] 2024/05/02 14:25:30.201230 [INF] Starting JetStream cluster
[1] 2024/05/02 14:25:30.201246 [INF] Creating JetStream metadata controller
[1] 2024/05/02 14:25:30.201507 [INF] JetStream cluster bootstrapping
[1] 2024/05/02 14:25:30.201980 [INF] Listening for client connections on 0.0.0.0:4222
[1] 2024/05/02 14:25:30.202065 [WRN] Detected orphaned stream '$G > BR', will cleanup
[1] 2024/05/02 14:25:30.202342 [INF] Server is ready
[1] 2024/05/02 14:25:30.202457 [INF] Cluster name is gq-nats
[1] 2024/05/02 14:25:30.202537 [INF] Listening for route connections on 0.0.0.0:6222
[1] 2024/05/02 14:25:30.208864 [ERR] Error trying to connect to route (attempt 1): lookup for host "gq-nats-0.gq-nats.generic-queue.svc.cluster.local": lookup gq-nats-0.gq-nats.generic-queue.svc.cluster.local on 10.96.0.10:53: no such host
[1] 2024/05/02 14:25:30.239648 [WRN] Detected orphaned stream '$G > LR', will cleanup
[1] 2024/05/02 14:25:30.240497 [WRN] Detected orphaned stream '$G > OEN', will cleanup
[1] 2024/05/02 14:25:30.243571 [WRN] Detected orphaned stream '$G > OR', will cleanup

Expected behavior

NATS should not delete the "OR" stream - as it and its consumers were recovered

[1] 2024/05/02 14:25:29.868233 [INF]   Restored 447,917,984 messages for stream '$G > OR'
[1] 2024/05/02 14:25:29.870547 [INF]   Recovering 852 consumers for stream - '$G > OR'

Server and client version

2.9.15

Host environment

Running as a Kubernetes statefulset.

Steps to reproduce

The "OR" stream was unavailable at the time of restart. The OR stream runs on a single node - referred to here as gq-nats-1

A series of issues with NATS began after we created a new stream that was tag-located to gq-nats-1:

2024-05-02 09:56:55 
[1] 2024/05/02 13:56:55.927633 [WRN] JetStream cluster stream '$G > BR' has NO quorum, stalled

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.538942 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.208975 [INF] JetStream Shutdown

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.069191 [INF] Initiating JetStream Shutdown...

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.039035 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: no message found

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.038791 [ERR] JetStream out of resources, will be DISABLED

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.038668 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: malformed or corrupt message

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.038632 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Critical write error: malformed or corrupt message

2024-05-02 09:56:31 
[1] 2024/05/02 13:56:31.931646 [INF] JetStream cluster new stream leader for '$G > BR'

2024-05-02 09:56:27 
[1] 2024/05/02 13:56:27.903391 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'

This then drove us to restart the service.

derekcollison commented 6 months ago

Orphaned means the server could not find any meta assignment from the meta layer after synching with up.

thorntonmc commented 6 months ago

Orphaned means the server could not find any meta assignment from the meta layer after synching with up.

Trying to understand this a bit more - here's the order of what happened:

  1. NATS restarts in a bad state
  2. The node in question comes up, sees messages and consumers from a stream called "OR", recovers them
  3. Doesn't see that stream in the meta layer, deletes the stream
  4. Later the same stream appears with 0 messages, on that same node.

If the stream doesn't exist in the meta layer after syncing up - why does the stream appear on the same node moments later?

derekcollison commented 6 months ago

Could you add some more information to restarts in a bad state?

thorntonmc commented 6 months ago

Could you add some more information to restarts in a bad state?

Here's the timeline - at 13:56UTC a new stream is created using the NATS client, bound to that node (gq-nats-1). We then notice these logs with what appears to be the node re-detecting every consumer for the stream - this happens several hundreds of times:

[1] 2024/05/02 13:56:20.039173 [INF] JetStream cluster new consumer leader for '$G > OR > [redacted]

After the "new consumer" logs stop - wee see these errors:

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.208975 [INF] JetStream Shutdown

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.069191 [INF] Initiating JetStream Shutdown...

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.039035 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: no message found

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.038791 [ERR] JetStream out of resources, will be DISABLED

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.038668 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: malformed or corrupt message

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.038632 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Critical write error: malformed or corrupt message

2024-05-02 09:56:31 
[1] 2024/05/02 13:56:31.931646 [INF] JetStream cluster new stream leader for '$G > BR'

2024-05-02 09:56:27 
[1] 2024/05/02 13:56:27.903391 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'

2024-05-02 09:56:27 
[1] 2024/05/02 13:56:27.402715 [WRN] Internal subscription on "$JS.API.STREAM.INFO.BR" took too long: 3.508628383s

2024-05-02 09:56:27 
[1] 2024/05/02 13:56:27.402701 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:27 
[1] 2024/05/02 13:56:27.402561 [INF] JetStream cluster new stream leader for '$G > BR'

2024-05-02 09:56:27 
[1] 2024/05/02 13:56:27.402500 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:24 
[1] 2024/05/02 13:56:24.593972 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'

2024-05-02 09:56:24 
[1] 2024/05/02 13:56:24.593812 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'

2024-05-02 09:56:24 
[1] 2024/05/02 13:56:24.363958 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:23 
[1] 2024/05/02 13:56:23.863979 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:23 
[1] 2024/05/02 13:56:23.364262 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:22 
[1] 2024/05/02 13:56:22.864354 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:22 
[1] 2024/05/02 13:56:22.363929 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:21 
[1] 2024/05/02 13:56:21.863344 [INF] Scaling down '$G > LR' to [gq-nats-1]

2024-05-02 09:56:21 
[1] 2024/05/02 13:56:21.363622 [INF] Scaling down '$G > LR' to [gq-nats-1]

Followed by repeated logging of the following:

[1] 2024/05/02 13:56:55.927633 [WRN] JetStream cluster stream '$G > BR' has NO quorum, stalled

and

2024-05-02 09:56:33 
[1] 2024/05/02 13:56:33.533104 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error apply commit for 2: raft: could not load entry from WAL

At this point - the node is unavailable, as are all the streams located on it - which prompts the restart of the cluster using kubectl rollout restart.

derekcollison commented 6 months ago

It says it ran out of resources and shutdown JetStream. We should address that first.

VeryStrongFingers commented 2 months ago

@derekcollison I've just run into a similar occurrence on NATS 2.9.20 with a cluster of 3, and my stream of only 1 replica getting wiped. Not sure if it's the same issue as OP though, let me know if you'd rather I create a separate issue.

tl;dr: Stream "AAAAA" was recovered but then wiped in the same process by nats-server after a restart (due to underlying jetstream volume was 100% full)

Cluster:

Streams:

Sequence of events:

  1. Jetstream volume (disk) hits 100% used on nats-1 server
  2. Attempted to restart nats-1 pod (before checking anything, obviously restart typically fixes everything)
    • failed. nats-1 was still on a startup crash/loop with logs below
nats-1 logs on crash/bootloop ``` ❯ k logs nats-1 --container nats [547] 2024/08/19 02:04:44.222427 [INF] Starting nats-server [547] 2024/08/19 02:04:44.222464 [INF] Version: 2.9.20 [547] 2024/08/19 02:04:44.222467 [INF] Git: [97dd7cb] [547] 2024/08/19 02:04:44.222469 [INF] Cluster: nats [547] 2024/08/19 02:04:44.222471 [INF] Name: nats-1 [547] 2024/08/19 02:04:44.222474 [INF] Node: yrzKKRBu [547] 2024/08/19 02:04:44.222477 [INF] ID: ~ [547] 2024/08/19 02:04:44.222483 [INF] Using configuration file: /etc/nats-config/nats.conf [547] 2024/08/19 02:04:44.223363 [INF] Starting http monitor on 0.0.0.0:8222 [547] 2024/08/19 02:04:44.223397 [INF] Starting JetStream [547] 2024/08/19 02:04:44.223506 [INF] _ ___ _____ ___ _____ ___ ___ _ __ __ [547] 2024/08/19 02:04:44.223508 [INF] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ | [547] 2024/08/19 02:04:44.223510 [INF] | || | _| | | \__ \ | | | / _| / _ \| |\/| | [547] 2024/08/19 02:04:44.223511 [INF] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_| [547] 2024/08/19 02:04:44.223512 [INF] [547] 2024/08/19 02:04:44.223513 [INF] https://docs.nats.io/jetstream [547] 2024/08/19 02:04:44.223514 [INF] [547] 2024/08/19 02:04:44.223515 [INF] ---------------- JETSTREAM ---------------- [547] 2024/08/19 02:04:44.223517 [INF] Max Memory: 0 B [547] 2024/08/19 02:04:44.223519 [INF] Max Storage: 10.00 GB [547] 2024/08/19 02:04:44.223521 [INF] Store Directory: "/data/jetstream" [547] 2024/08/19 02:04:44.223522 [INF] ------------------------------------------- [547] 2024/08/19 02:04:44.223828 [INF] Starting restore for stream '$G > AAAAA' [547] 2024/08/19 02:04:45.007605 [INF] Restored 19,792,763 messages for stream '$G > AAAAA' [547] 2024/08/19 02:04:45.007761 [INF] Starting restore for stream '$G > BBBBB' [547] 2024/08/19 02:04:45.008076 [INF] Restored 7,386 messages for stream '$G > BBBBB' [547] 2024/08/19 02:04:45.008110 [INF] Recovering 8 consumers for stream - '$G > AAAAA' [547] 2024/08/19 02:04:45.008955 [INF] Starting JetStream cluster [547] 2024/08/19 02:04:45.008964 [INF] Creating JetStream metadata controller [547] 2024/08/19 02:04:45.009279 [INF] JetStream cluster recovering state [547] 2024/08/19 02:04:45.016651 [WRN] Could not start metadata controller: could not create snapshots directory - mkdir /data/jetstream/$SYS/_js_/_meta_/snapshots: no space left on device [547] 2024/08/19 02:04:45.016663 [FTL] Can't start JetStream: could not create snapshots directory - mkdir /data/jetstream/$SYS/_js_/_meta_/snapshots: no space left on device ```
  1. Volume was resized
  2. nats-1 was restarted again, which is when the "AAAAA" stream was recovered then deleted (nats-1 is online with the cluster now)
nats-1 logs on restart (stream recover+wipe) ``` ❯ k logs nats-1 --container nats [52] 2024/08/19 02:26:52.121966 [INF] Starting nats-server [52] 2024/08/19 02:26:52.122022 [INF] Version: 2.9.20 [52] 2024/08/19 02:26:52.122038 [INF] Git: [97dd7cb] [52] 2024/08/19 02:26:52.122051 [INF] Cluster: nats [52] 2024/08/19 02:26:52.122064 [INF] Name: nats-1 [52] 2024/08/19 02:26:52.122084 [INF] Node: yrzKKRBu [52] 2024/08/19 02:26:52.122098 [INF] ID: ~ [52] 2024/08/19 02:26:52.122330 [INF] Using configuration file: /etc/nats-config/nats.conf [52] 2024/08/19 02:26:52.123241 [INF] Starting http monitor on 0.0.0.0:8222 [52] 2024/08/19 02:26:52.123273 [INF] Starting JetStream [52] 2024/08/19 02:26:52.123360 [INF] _ ___ _____ ___ _____ ___ ___ _ __ __ [52] 2024/08/19 02:26:52.123362 [INF] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ | [52] 2024/08/19 02:26:52.123363 [INF] | || | _| | | \__ \ | | | / _| / _ \| |\/| | [52] 2024/08/19 02:26:52.123365 [INF] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_| [52] 2024/08/19 02:26:52.123366 [INF] [52] 2024/08/19 02:26:52.123367 [INF] https://docs.nats.io/jetstream [52] 2024/08/19 02:26:52.123368 [INF] [52] 2024/08/19 02:26:52.123369 [INF] ---------------- JETSTREAM ---------------- [52] 2024/08/19 02:26:52.123371 [INF] Max Memory: 0 B [52] 2024/08/19 02:26:52.123373 [INF] Max Storage: 50.00 GB [52] 2024/08/19 02:26:52.123374 [INF] Store Directory: "/data/jetstream" [52] 2024/08/19 02:26:52.123376 [INF] ------------------------------------------- [52] 2024/08/19 02:26:52.123667 [INF] Starting restore for stream '$G > AAAAA' [52] 2024/08/19 02:26:52.909718 [INF] Restored 19,792,763 messages for stream '$G > AAAAA' [52] 2024/08/19 02:26:52.909834 [INF] Starting restore for stream '$G > BBBBB' [52] 2024/08/19 02:26:52.910189 [INF] Restored 7,386 messages for stream '$G > BBBBB' [52] 2024/08/19 02:26:52.910323 [INF] Recovering 8 consumers for stream - '$G > AAAAA' [52] 2024/08/19 02:26:52.911199 [INF] Starting JetStream cluster [52] 2024/08/19 02:26:52.911208 [INF] Creating JetStream metadata controller [52] 2024/08/19 02:26:52.911540 [INF] JetStream cluster recovering state [52] 2024/08/19 02:26:52.911954 [INF] Listening for client connections on 0.0.0.0:4222 [52] 2024/08/19 02:26:52.911978 [WRN] Detected orphaned stream '$G > AAAAA', will cleanup [52] 2024/08/19 02:26:52.912125 [INF] Server is ready [52] 2024/08/19 02:26:52.912145 [INF] Cluster name is nats [52] 2024/08/19 02:26:52.912176 [INF] Listening for route connections on 0.0.0.0:6222 [52] 2024/08/19 02:26:52.916458 [INF] 10.2.4.154:6222 - rid:28 - Route connection created [52] 2024/08/19 02:26:52.916895 [INF] 10.2.3.147:6222 - rid:29 - Route connection created [52] 2024/08/19 02:26:52.934425 [INF] 10.2.3.147:60874 - rid:30 - Route connection created [52] 2024/08/19 02:26:52.935030 [INF] 10.2.3.147:6222 - rid:29 - Router connection closed: Duplicate Route [52] 2024/08/19 02:26:52.936331 [INF] 10.2.3.147:60874 - rid:30 - Router connection closed: Client Closed [52] 2024/08/19 02:26:52.972944 [INF] 10.2.3.147:60902 - rid:31 - Route connection created [52] 2024/08/19 02:26:52.973135 [INF] 10.2.3.147:60890 - rid:32 - Route connection created [52] 2024/08/19 02:26:52.973422 [INF] 10.2.3.147:60902 - rid:31 - Router connection closed: Duplicate Route [52] 2024/08/19 02:26:53.012803 [WRN] Waiting for routing to be established... [52] 2024/08/19 02:26:53.018662 [INF] 10.2.3.147:60904 - rid:33 - Route connection created [52] 2024/08/19 02:26:53.019046 [INF] 10.2.3.147:60904 - rid:33 - Router connection closed: Duplicate Route [52] 2024/08/19 02:26:53.075148 [INF] 10.2.4.154:48354 - rid:34 - Route connection created [52] 2024/08/19 02:26:53.075391 [INF] 10.2.4.154:48354 - rid:34 - Router connection closed: Duplicate Route [52] 2024/08/19 02:26:53.231558 [WRN] Detected orphaned stream '$G > BBBBB', will cleanup [52] 2024/08/19 02:26:53.232347 [INF] JetStream cluster new metadata leader: nats-2/nats [52] 2024/08/19 02:26:53.235063 [INF] JetStream cluster new stream leader for '$G > AAAAA' [52] 2024/08/19 02:26:53.236201 [INF] JetStream cluster new consumer leader for '$G > AAAAA > logger-vsfds' [52] 2024/08/19 02:26:53.244121 [WRN] RAFT [yrzKKRBu - _meta_] Falling behind in health check, commit 3128454 != applied 0 [52] 2024/08/19 02:26:53.244132 [WRN] JetStream is not current with the meta leader [52] 2024/08/19 02:26:53.246372 [INF] JetStream cluster new consumer leader for '$G > BBBBB > PMGQkC39SrLs0pvkd6zCDB_1' [52] 2024/08/19 02:26:53.249819 [INF] JetStream cluster new consumer leader for '$G > AAAAA > wW0rW6BmVAXZTthq2HA5dc_3' [52] 2024/08/19 02:26:53.754579 [INF] 10.2.3.147:60908 - rid:102 - Route connection created [52] 2024/08/19 02:26:53.754732 [INF] 10.2.3.147:60908 - rid:102 - Router connection closed: Duplicate Route [52] 2024/08/19 02:26:53.965672 [INF] 10.2.3.147:6222 - rid:103 - Route connection created [52] 2024/08/19 02:26:53.966182 [INF] 10.2.3.147:6222 - rid:103 - Router connection closed: Duplicate Route [52] 2024/08/19 02:26:57.951420 [INF] JetStream cluster new consumer leader for '$G > BBBBB > EO8RDR3Ob3zTwDuxVizAiI_1' ```
  1. "AAAAA" was effectively restarted, back to sequence 0, no data on disk despite the Restored 19,792,763 messages for stream '$G > AAAAA' on the curent process logs

Warning logs excerpt:

[52] 2024/08/19 02:26:52.911978 [WRN] Detected orphaned stream '$G > AAAAA', will cleanup
[52] 2024/08/19 02:26:53.012803 [WRN] Waiting for routing to be established...
[52] 2024/08/19 02:26:53.231558 [WRN] Detected orphaned stream '$G > BBBBB', will cleanup
[52] 2024/08/19 02:26:53.244121 [WRN] RAFT [yrzKKRBu - _meta_] Falling behind in health check, commit 3128454 != applied 0
[52] 2024/08/19 02:26:53.244132 [WRN] JetStream is not current with the meta leader
wallyqs commented 2 months ago

Fixed via #5767