Closed sskserk closed 3 years ago
You are following the guide for NATS Server clustering (that is, having several NATS servers form a cluster). But you are using NATS Streaming and those are not configured to run in clustered mode, as far as I can tell from the logs. They are started as standalone, so when the second server starts, it sees that there is another standalone with the same "cluster ID" name and fails. This is expected.
For NATS Streaming, check this: https://docs.nats.io/nats-streaming-concepts/clustering and this https://docs.nats.io/nats-streaming-concepts/clustering/configuration if you want to run NATS Streaming as a cluster.
If not, let me know what you are trying to do. Also, showing your current configuration files or how you started from command line would help.
@kozlovic,
Mentioned by you manuals guided me toward success. I can bootstrap the 1st server, and then the next servers successfully connect to it. But I noticed that if the 1st server fails then the remaining two followers become non operational.
I launch the 1st server using the configuration: ... streaming { id: cluster store: file dir: "./datadir/nats_1" cluster { bootstrap: true raft_logging: true log_path: ./datadir/logs" } store_limits: { max_age: "336h" max_msgs: 0 max_bytes: 0 max_subs: 0 } hb_interval: "5s" hb_timeout: "2s" hb_fail_count: 2 } ...
Is it still possible to bootstrap the Nats streaming cluster without knowing it's exact size at the beginning and have the cluster operational if the 1st server fails?
How do you start the 2 others? Show the complete configuration files please.
Two follower nodes at the beginning have 1 route to the leader (bootstrap: true) node.
Yes, NATS will gossip and form a full mesh, so node 2 and 3 should be both connected to node 1 and themselves. Hope you are giving enough time before stopping node 1. Otherwise, you may want to explicitly have routes to each others.
But I believe that your issue is that you are using the same store location for the 3 nodes. In clustering mode, each node must have its own storage (both dir
and log_path
). Make them unique on each node and let me know if that solve your issue.
Reconsidered files structure still gives the same result. Full tree is attached.
The structure is the following.
user@userpc:~/dev/nats/nats_seed$ tree . ├── nats_1 │ ├── datadir │ │ ├── clients.dat │ │ ├── cluster │ │ │ ├── raft.log │ │ │ └── snapshots │ │ └── server.dat │ └── nats.conf ├── nats_2 │ ├── datadir │ │ ├── clients.dat │ │ ├── cluster │ │ │ ├── raft.log │ │ │ └── snapshots │ │ └── server.dat │ └── nats.conf ├── nats_3 │ ├── datadir │ │ ├── clients.dat │ │ ├── cluster │ │ │ ├── raft.log │ │ │ └── snapshots │ │ └── server.dat │ └── nats.conf └── nats-streaming-server
The 1st server is stopped by simple CTRL+C shortcut. Before interrupting the 1st server it can be seen that each node has 2 routes ( for instance at localhost:8222 ). The remaining 2 followers cannot reach quorum:
[15735] 2021/03/01 22:29:45.443361 [ERR] Error trying to connect to route (attempt 11): dial tcp 127.0.0.1:6222: connect: connection refused [15735] 2021/03/01 22:29:46.444179 [ERR] Error trying to connect to route (attempt 12): dial tcp 127.0.0.1:6222: connect: connection refused [15735] 2021/03/01 22:29:47.444954 [ERR] Error trying to connect to route (attempt 13): dial tcp 127.0.0.1:6222: connect: connection refused [15735] 2021/03/01 22:29:48.445871 [ERR] Error trying to connect to route (attempt 14): dial tcp 127.0.0.1:6222: connect: connection refused [15735] 2021/03/01 22:29:49.446192 [ERR] Error trying to connect to route (attempt 15): dial tcp 127.0.0.1:6222: connect: connection refused [15735] 2021/03/01 22:29:50.446998 [ERR] Error trying to connect to route (attempt 16): dial tcp 127.0.0.1:6222: connect: connection refused [15735] 2021/03/01 22:29:51.447867 [ERR] Error trying to connect to route (attempt 17): dial tcp 127.0.0.1:6222: connect: connection refused
The remaining 2 followers cannot reach quorum:
Have you cleared the original datadir
directories before restructuring? And how do you know it is not reaching quorum? Check the http://localhost:8223/serverz
and same for 8224 and check the "role" field.
I have been using your config files and I am able to have this work (after clearing the content of datadir). I also verified that node 2 and 3 have a route to each other after stopping node 1.
If you try to send messages, make sure you are pointing to either all 3 servers or at least server 2 or 3 when server 1 has been stopped.
Indeed, the clients are able to restore connection to the cluster even if the node_1 is down. I was distracted by the error messages. Was looking for something like "cluster/quorum restored" or "node_2 or node_3 is promoted into the Leader". Logs are containing error messages only, but not what I was expecting to see.
Anyway, from the Nats-streaming clients' perspective I see the cluster is able to restore its operational status. This is what I need. We will go into production with basic 3 VMs and we need to be able to increase the number of nodes dynamically.
Mr @kozlovic thank you so much for the help!
We enjoy the Nats very much!
Dear Nats Guys,
I'm trying to follow to the guide https://docs.nats.io/nats-server/configuration/clustering
Env:
While trying to launch the 2nd instance of Nats of observe the error: [FTL] STREAM: Failed to start: discovered another streaming server with cluster ID "test-cluster"
Please see the attached log files.