Failed bootstrapping of seeded cluster

sskserk commented 3 years ago

Dear Nats Guys,

I'm trying to follow to the guide https://docs.nats.io/nats-server/configuration/clustering

Env:

20.04.1-Ubuntu
nats-streaming-server version 0.20.0, nats-server: v2.1.9

While trying to launch the 2nd instance of Nats of observe the error: [FTL] STREAM: Failed to start: discovered another streaming server with cluster ID "test-cluster"

Please see the attached log files.

sskserk commented 3 years ago

srv1.log srv2.log

kozlovic commented 3 years ago

You are following the guide for NATS Server clustering (that is, having several NATS servers form a cluster). But you are using NATS Streaming and those are not configured to run in clustered mode, as far as I can tell from the logs. They are started as standalone, so when the second server starts, it sees that there is another standalone with the same "cluster ID" name and fails. This is expected.

For NATS Streaming, check this: https://docs.nats.io/nats-streaming-concepts/clustering and this https://docs.nats.io/nats-streaming-concepts/clustering/configuration if you want to run NATS Streaming as a cluster.

If not, let me know what you are trying to do. Also, showing your current configuration files or how you started from command line would help.

sskserk commented 3 years ago

@kozlovic,

Mentioned by you manuals guided me toward success. I can bootstrap the 1st server, and then the next servers successfully connect to it. But I noticed that if the 1st server fails then the remaining two followers become non operational.

I launch the 1st server using the configuration: ... streaming { id: cluster store: file dir: "./datadir/nats_1" cluster { bootstrap: true raft_logging: true log_path: ./datadir/logs" } store_limits: { max_age: "336h" max_msgs: 0 max_bytes: 0 max_subs: 0 } hb_interval: "5s" hb_timeout: "2s" hb_fail_count: 2 } ...

Is it still possible to bootstrap the Nats streaming cluster without knowing it's exact size at the beginning and have the cluster operational if the 1st server fails?

kozlovic commented 3 years ago

How do you start the 2 others? Show the complete configuration files please.

sskserk commented 3 years ago

cluster.tar.gz

Two follower nodes at the beginning have 1 route to the leader (bootstrap: true) node.

kozlovic commented 3 years ago

Yes, NATS will gossip and form a full mesh, so node 2 and 3 should be both connected to node 1 and themselves. Hope you are giving enough time before stopping node 1. Otherwise, you may want to explicitly have routes to each others.

But I believe that your issue is that you are using the same store location for the 3 nodes. In clustering mode, each node must have its own storage (both dir and log_path). Make them unique on each node and let me know if that solve your issue.

sskserk commented 3 years ago

Reconsidered files structure still gives the same result. Full tree is attached.

The structure is the following.

user@userpc:~/dev/nats/nats_seed$ tree . ├── nats_1 │ ├── datadir │ │ ├── clients.dat │ │ ├── cluster │ │ │ ├── raft.log │ │ │ └── snapshots │ │ └── server.dat │ └── nats.conf ├── nats_2 │ ├── datadir │ │ ├── clients.dat │ │ ├── cluster │ │ │ ├── raft.log │ │ │ └── snapshots │ │ └── server.dat │ └── nats.conf ├── nats_3 │ ├── datadir │ │ ├── clients.dat │ │ ├── cluster │ │ │ ├── raft.log │ │ │ └── snapshots │ │ └── server.dat │ └── nats.conf └── nats-streaming-server

nats.tar.gz

The 1st server is stopped by simple CTRL+C shortcut. Before interrupting the 1st server it can be seen that each node has 2 routes ( for instance at localhost:8222 ). The remaining 2 followers cannot reach quorum:

[15735] 2021/03/01 22:29:45.443361 [ERR] Error trying to connect to route (attempt 11): dial tcp 127.0.0.1:6222: connect: connection refused [15735] 2021/03/01 22:29:46.444179 [ERR] Error trying to connect to route (attempt 12): dial tcp 127.0.0.1:6222: connect: connection refused [15735] 2021/03/01 22:29:47.444954 [ERR] Error trying to connect to route (attempt 13): dial tcp 127.0.0.1:6222: connect: connection refused [15735] 2021/03/01 22:29:48.445871 [ERR] Error trying to connect to route (attempt 14): dial tcp 127.0.0.1:6222: connect: connection refused [15735] 2021/03/01 22:29:49.446192 [ERR] Error trying to connect to route (attempt 15): dial tcp 127.0.0.1:6222: connect: connection refused [15735] 2021/03/01 22:29:50.446998 [ERR] Error trying to connect to route (attempt 16): dial tcp 127.0.0.1:6222: connect: connection refused [15735] 2021/03/01 22:29:51.447867 [ERR] Error trying to connect to route (attempt 17): dial tcp 127.0.0.1:6222: connect: connection refused

kozlovic commented 3 years ago

The remaining 2 followers cannot reach quorum:

Have you cleared the original datadir directories before restructuring? And how do you know it is not reaching quorum? Check the http://localhost:8223/serverzand same for 8224 and check the "role" field.

I have been using your config files and I am able to have this work (after clearing the content of datadir). I also verified that node 2 and 3 have a route to each other after stopping node 1.

If you try to send messages, make sure you are pointing to either all 3 servers or at least server 2 or 3 when server 1 has been stopped.

sskserk commented 3 years ago

Indeed, the clients are able to restore connection to the cluster even if the node_1 is down. I was distracted by the error messages. Was looking for something like "cluster/quorum restored" or "node_2 or node_3 is promoted into the Leader". Logs are containing error messages only, but not what I was expecting to see.

Anyway, from the Nats-streaming clients' perspective I see the cluster is able to restore its operational status. This is what I need. We will go into production with basic 3 VMs and we need to be able to increase the number of nodes dynamically.

Mr @kozlovic thank you so much for the help!

We enjoy the Nats very much!

nats-io / nats-streaming-server

Failed bootstrapping of seeded cluster #1165