Closed csuriano23 closed 7 months ago
Just start a new server and add to the cluster, it will work.
But if you decide to remove it, you need to make sure to tell the system to remove it and that the peer is not coming back.
Thank you @derekcollison
I think I have reached the goal, setting up one cluster as:
version: "3.5"
services:
nats:
image: nats
ports:
- "4222:4222"
- "8222:8222"
command: "--server_name n1m1 --js --store_dir /data --cluster_name NATS --cluster nats://0.0.0.0:6222 --http_port 8222 --routes=nats://nats:6222"
networks: [ "nats" ]
nats-1:
image: nats
command: "--server_name n1m2 --js --store_dir /data --cluster_name NATS --cluster nats://0.0.0.0:6222 --routes=nats://nats:6222"
networks: [ "nats" ]
depends_on: [ "nats" ]
nats-2:
image: nats
command: "--server_name n1m3 --js --store_dir /data --cluster_name NATS --cluster nats://0.0.0.0:6222 --routes=nats://nats:6222"
networks: [ "nats" ]
depends_on: [ "nats" ]
networks:
nats:
name: nats
and attaching another as:
version: "3.5"
services:
nats-ancor:
image: nats
ports:
- "4223:4222"
command: "--server_name n2m1 --js --store_dir /data --cluster_name NATS --cluster nats://0.0.0.0:6222 --routes=nats://nats:6222,nats://nats-ancor:6222"
networks: [ "nats" ]
nats-ancor2:
image: nats
command: "--server_name n2m2 --js --store_dir /data --cluster_name NATS --cluster nats://0.0.0.0:6222 --routes=nats://nats:6222,nats://nats-ancor:6222"
networks: [ "nats" ]
networks:
nats:
name: nats
everything seems fine and messages published after tearing up the second compose (localhost:4222) are correctly received by subscribers attached on the other instance (localhost:4223).
The only issue I find is that sometimes, after tearing up the second compose, a continuous switching between leaders happens:
2022-11-25 09:08:58 [1] 2022/11/25 08:08:58.650248 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:08:59 [1] 2022/11/25 08:08:59.214923 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:00 [1] 2022/11/25 08:09:00.645630 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:01 [1] 2022/11/25 08:09:01.214003 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:01 [1] 2022/11/25 08:09:01.645159 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:02 [1] 2022/11/25 08:09:02.213404 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:02 [1] 2022/11/25 08:09:02.649775 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:03 [1] 2022/11/25 08:09:03.215610 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:03 [1] 2022/11/25 08:09:03.645244 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:04 [1] 2022/11/25 08:09:04.211452 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:04 [1] 2022/11/25 08:09:04.645996 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:05 [1] 2022/11/25 08:09:05.212703 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:05 [1] 2022/11/25 08:09:05.645641 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:06 [1] 2022/11/25 08:09:06.211454 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:06 [1] 2022/11/25 08:09:06.646649 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:07 [1] 2022/11/25 08:09:07.215309 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:07 [1] 2022/11/25 08:09:07.648165 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:08 [1] 2022/11/25 08:09:08.212681 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:08 [1] 2022/11/25 08:09:08.648338 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:09 [1] 2022/11/25 08:09:09.211939 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:09 [1] 2022/11/25 08:09:09.645624 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:10 [1] 2022/11/25 08:09:10.212488 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:10 [1] 2022/11/25 08:09:10.648109 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:11 [1] 2022/11/25 08:09:11.215453 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:11 [1] 2022/11/25 08:09:11.646004 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:12 [1] 2022/11/25 08:09:12.222187 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:12 [1] 2022/11/25 08:09:12.645080 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:13 [1] 2022/11/25 08:09:13.213355 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:13 [1] 2022/11/25 08:09:13.647581 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:14 [1] 2022/11/25 08:09:14.211603 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:14 [1] 2022/11/25 08:09:14.646100 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:15 [1] 2022/11/25 08:09:15.213722 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:15 [1] 2022/11/25 08:09:15.646205 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:16 [1] 2022/11/25 08:09:16.214133 [INF] JetStream cluster new metadata leader: n2m2/NATS
2022-11-25 09:09:16 [1] 2022/11/25 08:09:16.647944 [INF] JetStream cluster new metadata leader: n1m1/NATS
2022-11-25 09:09:17 [1] 2022/11/25 08:09:17.212649 [INF] JetStream cluster new metadata leader: n2m2/NATS
Maybe it is related to the caveat you described
That behavior is showing something is not working correctly..
What do you mean by "tearing up"?
What does nats server report jetstream
show?
For tearing up I simply mean docker compose up -d --build
When applying nats server report jetstream --server=...
I receive:
nats: error: server request failed, ensure the account used has system privileges and appropriate permissions
both on localhost and on nats-box started on the same nats network (docker run --rm -it --network=nats natsio/nats-box:latest
)
You will need a system user for many of the NATS cli commands. Looping in @wallyqs.
I managed to reproduce the issue with a system user configured.
These are the files:
./v0/docker-compose.yml
version: "3.8"
services:
n1m1:
image: nats:2.9.3
restart: unless-stopped
ports:
- "4222:4222"
- "8222:8222"
volumes:
- ./config:/config
command: "--config /config/n1m1.conf"
networks:
- backbone
n1m2:
image: nats:2.9.3
restart: unless-stopped
command: "--config /config/n1m2.conf"
volumes:
- ./config:/config
networks:
- backbone
depends_on:
- n1m1
n1m3:
image: nats
restart: unless-stopped
command: "--config /config/n1m3.conf"
volumes:
- ./config:/config
networks:
- backbone
depends_on:
- n1m1
nats-cli:
image: natsio/nats-box:0.13.2
restart: unless-stopped
tty: true
networks:
- backbone
networks:
backbone:
name: backbone
./v1/docker-compose.yml
version: "3.8"
services:
n2m1:
image: nats:2.9.3
restart: unless-stopped
ports:
- "4223:4222"
volumes:
- ./config:/config
command: "--config /config/n2m1.conf"
networks:
- backbone
n2m2:
image: nats:2.9.3
restart: unless-stopped
volumes:
- ./config:/config
command: "--config /config/n2m2.conf"
networks:
- backbone
networks:
backbone:
name: backbone
./v0/config/n1m1.conf
server_name=n1m1
listen=4222
http_port=8222
accounts {
$SYS { users = [ { user: "admin", pass: "admin" } ] }
}
jetstream {
store_dir=/data
}
cluster {
name: NATS
listen: 0.0.0.0:6222
routes: [
nats-route://n1m2:6222
nats-route://n1m3:6222
]
}
./v0/config/n1m2.conf
server_name=n1m2
listen=4222
accounts {
$SYS { users = [ { user: "admin", pass: "admin" } ] }
}
jetstream {
store_dir=/data
}
cluster {
name: NATS
listen: 0.0.0.0:6222
routes: [
nats-route://n1m1:6222
nats-route://n1m3:6222
]
}
./v0/config/n1m3.conf
server_name=n1m3
listen=4222
accounts {
$SYS { users = [ { user: "admin", pass: "admin" } ] }
}
jetstream {
store_dir=/data
}
cluster {
name: NATS
listen: 0.0.0.0:6222
routes: [
nats-route://n1m1:6222
nats-route://n1m2:6222
]
}
./v1/config/n2m1.conf
server_name=n2m1
listen=4222
accounts {
$SYS { users = [ { user: "admin", pass: "admin" } ] }
}
jetstream {
store_dir=/data
}
cluster {
name: NATS
listen: 0.0.0.0:6222
routes: [
nats-route://n2m2:6222
nats-route://n1m1:6222
]
}
./v1/config/n2m2.conf
server_name=n2m2
listen=4222
accounts {
$SYS { users = [ { user: "admin", pass: "admin" } ] }
}
jetstream {
store_dir=/data
}
cluster {
name: NATS
listen: 0.0.0.0:6222
routes: [
nats-route://n2m1:6222
nats-route://n1m1:6222
]
}
Also the reporting behavior seems to be periodic, switching between:
~ # nats server report jetstream --server=n1m1 --user=admin --password=admin
╭───────────────────────────────────────────────────────────────────────────────────────────────╮
│ JetStream Summary │
├────────┬─────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
│ Server │ Cluster │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├────────┼─────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ n1m1 │ NATS │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ n2m1* │ NATS │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ n2m2 │ NATS │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ n1m3 │ NATS │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ n1m2* │ NATS │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
├────────┼─────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ │ │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
╰────────┴─────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯
╭─────────────────────────────────────────────────╮
│ RAFT Meta Group Information │
├──────┬────────┬─────────┬────────┬────────┬─────┤
│ Name │ Leader │ Current │ Online │ Active │ Lag │
├──────┼────────┼─────────┼────────┼────────┼─────┤
│ n1m1 │ │ true │ true │ 0.68s │ 0 │
│ n1m2 │ yes │ true │ true │ 0.00s │ 0 │
│ n1m3 │ │ true │ true │ 0.68s │ 0 │
│ n2m1 │ │ false │ true │ 0.68s │ 9 │
│ n2m2 │ │ true │ true │ 0.68s │ 0 │
╰──────┴────────┴─────────┴────────┴────────┴─────╯
and
~ # nats server report jetstream --server=n1m1 --user=admin --password=admin
╭───────────────────────────────────────────────────────────────────────────────────────────────╮
│ JetStream Summary │
├────────┬─────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┤
│ Server │ Cluster │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├────────┼─────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ n1m1 │ NATS │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ n2m2 │ NATS │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ n1m2* │ NATS │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ n1m3 │ NATS │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
│ n2m1* │ NATS │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
├────────┼─────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┤
│ │ │ 0 │ 0 │ 0 │ 0 B │ 0 B │ 0 B │ 0 │ 0 │
╰────────┴─────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────╯
╭─────────────────────────────────────────────────╮
│ RAFT Meta Group Information │
├──────┬────────┬─────────┬────────┬────────┬─────┤
│ Name │ Leader │ Current │ Online │ Active │ Lag │
├──────┼────────┼─────────┼────────┼────────┼─────┤
│ n1m1 │ │ true │ true │ 0.14s │ 0 │
│ n1m2 │ │ false │ true │ 0.14s │ 11 │
│ n1m3 │ │ true │ true │ 0.14s │ 0 │
│ n2m1 │ yes │ true │ true │ 0.00s │ 0 │
│ n2m2 │ │ true │ true │ 0.14s │ 0 │
╰──────┴────────┴─────────┴────────┴────────┴─────╯
This strange behavior is more likely to happen when doing docker compose up
before on ./v1/docker-compose.yml and suddenly after on ./v0/docker-compose.yml.
Edit: sorry, I mismatched the order, now I have fixed
Thanks will take a look. Its seems the system thinks it has 2 meta leaders which is not desired of course.
@csuriano23 You mentioned that you have it fixed now, what ended up being the problem?
@csuriano23 You mentioned that you have it fixed now, what ended up being the problem?
Sorry, I mean I've fixed the comment, I specified the wrong startup order to reproduce the issue
Gotcha! I'll look into this today as well and see if I can find the issue
any luck with this issue?
I think I'm getting a similar one when using nats-server in an capacity=3 auto scaling group on AWS. If an instance goes away, the two remaining instances will fail with "JetStream cluster no metadata leader" in loop whenever a replacement instance managed to connect (with routes to the 2 remaining servers).
I dont think its appropriate to autoscale a database tbh - and for the context of jetstream it basically is a database - why do you want to auto scale it? It seems inherently incompatible with the very concept.
@ripienaar The AWS "autoscaling" terminology here should be understood as "auto instance replacement". If a node fails for any reason (eg. underlying hardware failure), I would like the ASG to automatically launch a new node (that connects to the cluster as a replacement of the lost one).
I have changed "server_name" to use "availability zone" instead of "instance id" - and it seems the newly launched instance (of the same name of the failed instance) is able to catch up and serve queries correctly.
is that not a supported use case?
You can replace nodes as long as the over all up node count maintains a quorum and as long as new nodes coming in has the same server name. Then it will work. What you cant generally do is dynamically change the count of nodes or (easily) replace nodes with ones with new server names configuration set.
@ripienaar Thank you very much, that explains it, my earlier problem was indeed due to my usage of a new unique server_name for the replacement node instead of reusing the same server name.
Just to double-check, if the new replacement node of the same name starts with a new disk (storage of the failed instance lost) that's a supported use case as well and the new node shall restore the data as and if needed from the other nodes, correct?
Thats correct, it can take a good while and during that time this node is essentially not yet read and 100% available, so if you do this as a rolling maintenance you need to be very careful.
Hi there, from what I read from the docs, it seems there is no way to configure a NATS JetStream cluster and then dynamically add a node to the existing cluster.
My need is specifically for local environment, so that I can tear up each microservice docker stack atomically for unit testing and then clustering all the instances for integration testing without having to rewrite the stack.
Any idea on how to achieve that?