Closed hprovenza closed 4 years ago
How would these list of peers be passed to the library? Yes, if you know them beforehand, I think it makes sense to add that support to the library.
I'll start with a config file as I believe I will know them ahead of time.
That's fine, go ahead. Please generate a PR as needed.
I'm currently running into an error that I don't understand, similar to this one that you weighed in on. In order to ensure a consistent test environment for failover, I'm trying to start the rqlite nodes from within the test.
ProcessBuilder pb1 = new ProcessBuilder("./rqlited", "-node-id", "node.2",
"-http-addr", "localhost:4007", "-raft-addr", "localhost:4008", "/home/hprovenz/node.2");
pb1.inheritIO();
pb1.directory(new File("/home/hprovenz/rqlite-v5.4.0-linux-amd64"));
Process node1 = pb1.start();
ProcessBuilder pb2 = new ProcessBuilder("./rqlited", "-node-id", "node.3",
"-http-addr", "localhost:4003", "-raft-addr", "localhost:4004", "-join", "localhost:4007",
"/home/hprovenz/node.3");
pb2.inheritIO();
pb2.directory(new File("/home/hprovenz/rqlite-v5.4.0-linux-amd64"));
Process node2 = pb2.start();
The servers seem to be starting correctly, judging from the output, but when they try to connect to one another I get the following message from node.2:
2020/09/02 09:49:44 [DEBUG] raft-net: 127.0.0.1:4008 accepted connection from: 127.0.0.1:46440
2020/09/02 09:49:44 [ERR] raft-net: Failed to decode incoming command: unknown rpc type 80
Do you have any insight on why this might happen when the servers start in Java, but not when I start them from command line directly, or how to work around it?
I think you're telling one node's Raft communication system to talk to the HTTP port of another node. Check your command line parameters. -join
should take the HTTP address of a node, not the Raft address of that node.
I see - I must have got switched up but I did try it both ways - when I join the http port I get:
[cluster-join] 2020/09/02 11:36:42 failed to join cluster at [localhost:4007]: Post "http://localhost:4007/join": dial tcp [::1]:4007: connect: connection refused, sleeping 5s before retry
2020-09-02T11:36:44.332-0400 [WARN] raft: no known peers, aborting election
Well, there is nothing wrong with rqlite itself. It's extensively tested so you must be launching it incorrectly.
I suggest you create the cluster by hand, and compare the options passed to rqlite when launched manually with what your code is doing. There will be a difference.
I'm looking at rqlite-java in the light of failover capabilities as was discussed here.
To summarize, say our nodes are listening at alice.foo.com:4001, bob.foo.com:8002, and carl.foo.com:6003 and alice is the leader. Right now we'd create out rqlite-java connection to alice. If the node on alice is temporarily or permanently disabled a new node will take over as leader, but I tested rqlite-java locally and I don't think that this client handles the situation.
In the Google group discussion you suggested a domain name service, but that isn't going to work for my purposes. Instead, I'd like to propose following the model of gorqlite, which automatically generates a list of peers and tries those peers if the leader is lost. We can follow 301 redirects too, which I see is implemented for rqlite-js.
Thoughts?