nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.49k stars 1.38k forks source link

`nats stream cluster peer-remove` puts R1 stream in non-recoverable state #4396

Closed jzhn closed 1 year ago

jzhn commented 1 year ago

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

Versions of nats-server and affected client libraries used:

$ nats-server --version
nats-server: v2.9.21

$ nats --version
0.0.35

OS/Container environment:

macOS 13.5 (22G74)

Steps or code to reproduce the issue:

  1. Setup a simple 3-cluster super cluster, each cluster with 1 server. Use steps from here: https://natsbyexample.com/examples/topologies/supercluster-jetstream/cli
    
    $ nats --context east-sys server report jetstream
    ╭───────────────────────────────────────────────────────────────────────────────────────────────╮
    │                                       JetStream Summary                                       │
    ├────────┬─────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
    │ Server │ Cluster │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
    ├────────┼─────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
    │ n2     │ central │ 1       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 3       │ 0       │
    │ n1*    │ east    │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 22      │ 1       │
    │ n3     │ west    │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │
    ├────────┼─────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
    │        │         │ 1       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 25      │ 1       │
    ╰────────┴─────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯

╭────────────────────────────────────────────────────────────╮ │ RAFT Meta Group Information │ ├──────┬──────────┬────────┬─────────┬────────┬────────┬─────┤ │ Name │ ID │ Leader │ Current │ Online │ Active │ Lag │ ├──────┼──────────┼────────┼─────────┼────────┼────────┼─────┤ │ n1 │ fjFyEjc1 │ yes │ true │ true │ 0.00s │ 0 │ │ n2 │ 44jzkV9D │ │ true │ true │ 0.44s │ 0 │ │ n3 │ BXScrY9i │ │ true │ true │ 0.44s │ 0 │ ╰──────┴──────────┴────────┴─────────┴────────┴────────┴─────╯

2. Create a simple R1 stream

$ nats --context east stream add \ --subjects test \ --storage file \ --replicas 1 \ --retention limits \ --discard old \ --max-age 1m \ --max-msgs=100 \ --max-msgs-per-subject=-1 \ --max-msg-size=-1 \ --max-bytes=-1 \ --dupe-window=1m \ --no-allow-rollup \ --no-deny-delete \ --no-deny-purge \ test

3. Verify that stream is created and landed on one of the cluster

$ nats --context east stream report Obtaining Stream stats

╭─────────────────────────────────────────────────────────────────────────────────────────╮ │ Stream Report │ ├────────┬─────────┬───────────┬───────────┬──────────┬───────┬──────┬─────────┬──────────┤ │ Stream │ Storage │ Placement │ Consumers │ Messages │ Bytes │ Lost │ Deleted │ Replicas │ ├────────┼─────────┼───────────┼───────────┼──────────┼───────┼──────┼─────────┼──────────┤ │ test │ File │ │ 0 │ 0 │ 0 B │ 0 │ 0 │ n2* │ ╰────────┴─────────┴───────────┴───────────┴──────────┴───────┴──────┴─────────┴──────────╯

4. use `peer-remove` command on the newly created stream

$ nats --context east stream cluster peer-remove test ? Select a Peer n2 11:33:19 Removing peer "n2" nats: error: peer remap failed (10075)


#### Expected result:

The `peer-remove` command either
- fails with error message that stream cannot be re-located in another server of the same cluster (since all clusters in this super cluster are single-node)
- succeeds and relocates the stream to another cluster.

#### Actual result:

- The `peer-remove` command fails and leaves the stream in middle state.
- The stream does not have any replicas

$ nats --context east stream report Obtaining Stream stats

╭─────────────────────────────────────────────────────────────────────────────────────────╮ │ Stream Report │ ├────────┬─────────┬───────────┬───────────┬──────────┬───────┬──────┬─────────┬──────────┤ │ Stream │ Storage │ Placement │ Consumers │ Messages │ Bytes │ Lost │ Deleted │ Replicas │ ├────────┼─────────┼───────────┼───────────┼──────────┼───────┼──────┼─────────┼──────────┤ │ test │ File │ │ 0 │ 0 │ 0 B │ 0 │ 0 │ │ ╰────────┴─────────┴───────────┴───────────┴──────────┴───────┴──────┴─────────┴──────────╯

- Any command to manage or inspect the stream returns error. __There's no way to unblock the stream, or to remove it from the cluster.__

$ nats --context east stream edit test nats: error: could not request Stream test configuration: stream is offline (10118)

$ nats --context east stream rm test ? Really delete Stream test Yes nats: error: could not remove Stream: stream is offline (10118)

$ nats --context east stream info test nats: error: could not request Stream info: stream is offline (10118)


- It is impossible to create another stream that subscribe to the same subject(s). So when this issue happens, the cluster is in really bad shape that certain subjects cannot be subscribed by jetstream.
derekcollison commented 1 year ago

Left the error code the same as we will pull this into 2.9.22. Can look at expanding the error description possibly in 2.10.

And if you want to move a stream, you can place it in any cluster or provide placement tags that it will use to select new peers and possibly a new cluster.