nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.56k stars 1.39k forks source link

Streams seem to disappear when clusters are connected via gateway #4816

Open pcsegal opened 9 months ago

pcsegal commented 9 months ago

Observed behavior

I'm trying to set up a gateway connection between two clusters, A and B, in order to migrate streams and KV buckets from cluster A to cluster B and decommission cluster B.

To create the gateway connections, I'm using an arbiter cluster with a single node.

I have added a reproducible example below, including NATS configuration files and a script, test.sh, that runs the example and reproduces the issue I'm seeing.

The steps in test.sh do the following:

  1. Start cluster A and cluster B (they define a gateway block with listening port, but are not connected; they are independent clusters).
  2. Create a KV bucket in cluster A with some data.
  3. Start the arbiter. This connects cluster A and cluster B into a supercluster.
  4. Check that the KV bucket is visible from both cluster A and cluster B.
  5. Edit the KV bucket's stream to move it to cluster B.
  6. For each node in cluster A, shut it down and run nats server raft peer-remove to decommission it.
  7. Verify that the KV bucket is in cluster B.

There is a problem that I see in step 3 after repeatedly running test.sh a few times. After the arbiter is up, the KV bucket sometimes disappears, and is no longer visible in any cluster. This happens randomly when re-running test.sh. Note I used KV buckets here just as an example; this happens when I test it with pure streams as well.

Here is an example of what the output looks like when this happens; this is printed after starting up the arbiter and waiting for a few seconds:

KV buckets in cluster-a:
No Key-Value buckets found
KV buckets in cluster-b:
No Key-Value buckets found
nats: error: could not pick a Stream to operate on: unknown stream "KV_example"

Expected behavior

I expect that, after the arbiter is up and the gateway connections are established, any stream that previously existed in cluster A or B continues existing in cluster A or B.

Server and client version

NATS server version: 2.10.5. NATS client version: 0.1.1

Host environment

Ubuntu 20.04.6.

Steps to reproduce

cluster-a-common.conf

accounts: {
    $SYS: {
        users: [
            {user: admin, password: pwd}
        ]
    },
}

cluster-a-0.conf

port: 4222

http: 8222
server_name:cluster-a-0

jetstream {
    max_mem: 1Gi
    store_dir: /tmp/nats-gateway-issue/cluster-a-0
}
lame_duck_grace_period: 10s
lame_duck_duration: 30s

cluster {
    listen: [0.0.0.0:6222](http://0.0.0.0:6222/)
    name: cluster-a
    routes = [
        nats://[127.0.0.1:6222](http://127.0.0.1:6222/),
        nats://[127.0.0.1:6223](http://127.0.0.1:6223/),
        nats://[127.0.0.1:6224](http://127.0.0.1:6224/),
    ]
    connect_retries: 120
}

gateway {
  name: cluster-a
  port: 7222
}

include cluster-a-common.conf

cluster-a-1.conf

port: 4223

http: 8223
server_name:cluster-a-1

jetstream {
    max_mem: 1Gi
    store_dir: /tmp/nats-gateway-issue/cluster-a-1
}
lame_duck_grace_period: 10s
lame_duck_duration: 30s

cluster {
    listen: [0.0.0.0:6223](http://0.0.0.0:6223/)
    name: cluster-a
    routes = [
        nats://[127.0.0.1:6222](http://127.0.0.1:6222/),
        nats://[127.0.0.1:6223](http://127.0.0.1:6223/),
        nats://[127.0.0.1:6224](http://127.0.0.1:6224/),
    ]
    connect_retries: 120
}

gateway {
  name: cluster-a
  port: 7223
}

include cluster-a-common.conf

cluster-a-2.conf

port: 4224

http: 8224
server_name:cluster-a-2

jetstream {
    max_mem: 1Gi
    store_dir: /tmp/nats-gateway-issue/cluster-a-2
}
lame_duck_grace_period: 10s
lame_duck_duration: 30s

cluster {
    listen: [0.0.0.0:6224](http://0.0.0.0:6224/)
    name: cluster-a
    routes = [
        nats://[127.0.0.1:6222](http://127.0.0.1:6222/),
        nats://[127.0.0.1:6223](http://127.0.0.1:6223/),
        nats://[127.0.0.1:6224](http://127.0.0.1:6224/),
    ]
    connect_retries: 120
}

gateway {
  name: cluster-a
  port: 7224
}

include cluster-a-common.conf

cluster-b-common.conf

accounts {
    $SYS: {
        users: [
            {user: admin, password: pwd}
        ]
    },
}

cluster-b-0.conf

port: 4225

http: 8225
server_name: cluster-b-0

jetstream {
    max_mem: 1Gi
    store_dir: /tmp/nats-gateway-issue/cluster-b-0
}
lame_duck_grace_period: 10s
lame_duck_duration: 30s

cluster {
    listen: [0.0.0.0:6225](http://0.0.0.0:6225/)
    name: cluster-b
    routes = [
        nats://[127.0.0.1:6225](http://127.0.0.1:6225/),
        nats://[127.0.0.1:6226](http://127.0.0.1:6226/),
        nats://[127.0.0.1:6227](http://127.0.0.1:6227/),
    ]
    connect_retries: 120
}

gateway {
  name: cluster-b
  port: 7225
}

include cluster-b-common.conf

cluster-b-1.conf

port: 4226

http: 8226
server_name: cluster-b-1

jetstream {
    max_mem: 1Gi
    store_dir: /tmp/nats-gateway-issue/cluster-b-1
}
lame_duck_grace_period: 10s
lame_duck_duration: 30s

cluster {
    listen: [0.0.0.0:6226](http://0.0.0.0:6226/)
    name: cluster-b
    routes = [
        nats://[127.0.0.1:6225](http://127.0.0.1:6225/),
        nats://[127.0.0.1:6226](http://127.0.0.1:6226/),
        nats://[127.0.0.1:6227](http://127.0.0.1:6227/),
    ]
    connect_retries: 120
}

gateway {
  name: cluster-b
  port: 7226
}

include cluster-b-common.conf

cluster-b-2.conf

port: 4227

http: 8227
server_name: cluster-b-2

jetstream {
    max_mem: 1Gi
    store_dir: /tmp/nats-gateway-issue/cluster-b-2
}
lame_duck_grace_period: 10s
lame_duck_duration: 30s

cluster {
    listen: [0.0.0.0:6227](http://0.0.0.0:6227/)
    name: cluster-b
    routes = [
        nats://[127.0.0.1:6225](http://127.0.0.1:6225/),
        nats://[127.0.0.1:6226](http://127.0.0.1:6226/),
        nats://[127.0.0.1:6227](http://127.0.0.1:6227/),
    ]
    connect_retries: 120
}

gateway {
  name: cluster-b
  port: 7227
}

include cluster-b-common.conf

arbiter.conf

server_name: arbiter
port: 4228
http_port: 8228

jetstream: {
  enabled: true
}

cluster {
    listen: [0.0.0.0:6228](http://0.0.0.0:6228/)
    name: arbiter
    routes = [
        nats://[127.0.0.1:6228](http://127.0.0.1:6228/),
    ]
    connect_retries: 120
}

gateway {
  name: arbiter
  port: 7228
  gateways: [
    {name: cluster-a, urls: [
      nats://localhost:7222,
      nats://localhost:7223,
      nats://localhost:7224,
    ]},
    {name: cluster-b, urls: [
      nats://localhost:7225,
      nats://localhost:7226,
      nats://localhost:7227,
    ]},
  ]
}

test.sh

#!/bin/bash

set -euo pipefail

# Clean up: make sure there are no NATS servers left over from a previous run of this script.

pkill -f -9 'nats.*cluster-a' || true
pkill -f -9 'nats.*cluster-b' || true
pkill -f -9 'nats.*arbiter' || true

# Clean up data directories used in this example to make sure we start from scratch.

rm -rf /tmp/nats-gateway-issue || true

mkdir -p /tmp/nats-gateway-issue

# Start cluster-a.

nats-server -c cluster-a-0.conf > /tmp/nats-gateway-issue/cluster-a-0.log 2>&1 &
nats-server -c cluster-a-1.conf > /tmp/nats-gateway-issue/cluster-a-1.log 2>&1 &
nats-server -c cluster-a-2.conf > /tmp/nats-gateway-issue/cluster-a-2.log 2>&1 &

# Start cluster-b.

nats-server -c cluster-b-0.conf > /tmp/nats-gateway-issue/cluster-b-0.log 2>&1 &
nats-server -c cluster-b-1.conf > /tmp/nats-gateway-issue/cluster-b-1.log 2>&1 &
nats-server -c cluster-b-2.conf > /tmp/nats-gateway-issue/cluster-b-2.log 2>&1 &

sleep 2

# Wait for cluster-a and cluster-b to be ready.

curl --fail --silent --retry 5 --retry-delay 1 http://localhost:8222/healthz > /dev/null
curl --fail --silent --retry 5 --retry-delay 1 http://localhost:8223/healthz > /dev/null
curl --fail --silent --retry 5 --retry-delay 1 http://localhost:8224/healthz > /dev/null
curl --fail --silent --retry 5 --retry-delay 1 http://localhost:8225/healthz > /dev/null
curl --fail --silent --retry 5 --retry-delay 1 http://localhost:8226/healthz > /dev/null
curl --fail --silent --retry 5 --retry-delay 1 http://localhost:8227/healthz > /dev/null

sleep 3

# Add a KV bucket with some data to cluster-a.

nats -s localhost:4222 kv add example

nats -s localhost:4222 kv put example key1 value1
nats -s localhost:4222 kv put example key2 value2
nats -s localhost:4222 kv put example key3 value3

# Start arbiter.

nats-server -c arbiter.conf > /tmp/nats-gateway-issue/arbiter.log 2>&1 &
sleep 1
curl --fail --silent --retry 5 --retry-delay 1 http://localhost:8228/healthz > /dev/null

# Arbiter up. Now cluster-a and cluster-b plus arbiter should form a supercluster.

sleep 3

# Both clusters should see the KV bucket.

echo "KV buckets in cluster-a:"
nats -s localhost:4222 kv ls #--user a --password a
echo "KV buckets in cluster-b:"
nats -s localhost:4225 kv ls #--user a --password a

# Migrate KV bucket's stream to cluster-b.

nats -s localhost:4222 stream edit KV_example --cluster cluster-b --force

sleep 5

# Kill cluster-a.

pkill -f "cluster-a-0"
sleep 1
nats -s localhost:4225 server raft peer-remove cluster-a-0 --user admin --password pwd

pkill -f "cluster-a-1"
sleep 1
nats -s localhost:4225 server raft peer-remove cluster-a-1 --user admin --password pwd

pkill -f "cluster-a-2"
sleep 1
nats -s localhost:4225 server raft peer-remove cluster-a-2 --user admin --password pwd

# KV bucket should be in cluster-b.

nats -s localhost:4225 kv ls
pcsegal commented 9 months ago

Had forgotten to include the configuration files in the reproducible example. Edited to include them.

derekcollison commented 8 months ago

What role is the arbiter playing? Why not connect cluster A directly to cluster B via GWs? Of extend cluster A with new nodes representing the new servers, and have them tagged, and use tags to move the assets?

pcsegal commented 6 months ago

Sorry for the delay.

True, this is not the simplest way to do a data migration. In fact, I ended up using a simpler strategy.

But, in any case, I posted this issue here because it looks like establishing a gateway connection shouldn't cause this observed behavior.

The role of the arbiter (not necessarily just in this specific migration scenario) would be for JetStream fault tolerance: if one cluster goes down, you still have 2 other clusters, and so JetStream is still available. If I understand correctly, this fault tolerance would not exist if we only connected 2 clusters via gateway.

pcsegal commented 3 months ago

Hi, I'm wondering if you had time to investigate this further.

ripienaar commented 3 months ago

Does the latest version still exhibit this behaviour ? There are some RC releases ready also for the next.

pcsegal commented 3 months ago

Yes, I tested it again with 2.10.17-RC6 and it still seems to exhibit the behavior of sometimes not showing any KV anymore after starting up the arbiter node.

ripienaar commented 3 months ago

This config where the arbiter node is the only one with the cluster shape defines is probably not really a supported configuration, jetstream wants a mostly static setup and wants to know the shape of things, you should list the full cluster everywhere rather than have this one arbitrer that configures up the cluster, it's not just not a supported approach I think.

It might work now and then and in some cases but its just not going to survive any outages or situations where the arbiter isnt around and the arbiter becomes a quite nasty SPOF.

Better use the software as designed.

arkh-consensys commented 2 months ago

I have observed a similar behaviour by only adding system account to jetstream k8s helm configuration and we lost the stream completely.

Adding such config to the helm charts

 jetstream:
        max_memory_store: << 1GB >>
      accounts: {
          SYS: {
              users: [
                  { user: admin, password: << $ADMIN_PASSWORD >> }
              ]
          },
      }
      system_account: SYS

Caused our stream to be removed!

I assume it is not safe to add account (or system account) to a running server. Previously our Jetstream server had no accounts enabled.

wallyqs commented 2 months ago

@arkh-consensys it had a system account but it was named $SYS which is the default one and then here it was renamed to SYS.