Failover capabilities - Githubissues

hprovenza commented 4 years ago

I'm looking at rqlite-java in the light of failover capabilities as was discussed here.

To summarize, say our nodes are listening at alice.foo.com:4001, bob.foo.com:8002, and carl.foo.com:6003 and alice is the leader. Right now we'd create out rqlite-java connection to alice. If the node on alice is temporarily or permanently disabled a new node will take over as leader, but I tested rqlite-java locally and I don't think that this client handles the situation.

In the Google group discussion you suggested a domain name service, but that isn't going to work for my purposes. Instead, I'd like to propose following the model of gorqlite, which automatically generates a list of peers and tries those peers if the leader is lost. We can follow 301 redirects too, which I see is implemented for rqlite-js.

Thoughts?

otoolep commented 4 years ago

How would these list of peers be passed to the library? Yes, if you know them beforehand, I think it makes sense to add that support to the library.

hprovenza commented 4 years ago

I'll start with a config file as I believe I will know them ahead of time.

otoolep commented 4 years ago

That's fine, go ahead. Please generate a PR as needed.

hprovenza commented 4 years ago

I'm currently running into an error that I don't understand, similar to this one that you weighed in on. In order to ensure a consistent test environment for failover, I'm trying to start the rqlite nodes from within the test.

        ProcessBuilder pb1 = new ProcessBuilder("./rqlited", "-node-id", "node.2",
                "-http-addr", "localhost:4007", "-raft-addr", "localhost:4008", "/home/hprovenz/node.2");
        pb1.inheritIO();
        pb1.directory(new File("/home/hprovenz/rqlite-v5.4.0-linux-amd64"));
        Process node1 = pb1.start();
        ProcessBuilder pb2 = new ProcessBuilder("./rqlited", "-node-id", "node.3",
                "-http-addr", "localhost:4003", "-raft-addr", "localhost:4004", "-join", "localhost:4007",
                "/home/hprovenz/node.3");
        pb2.inheritIO();
        pb2.directory(new File("/home/hprovenz/rqlite-v5.4.0-linux-amd64"));
        Process node2 = pb2.start();

The servers seem to be starting correctly, judging from the output, but when they try to connect to one another I get the following message from node.2:

2020/09/02 09:49:44 [DEBUG] raft-net: 127.0.0.1:4008 accepted connection from: 127.0.0.1:46440
2020/09/02 09:49:44 [ERR] raft-net: Failed to decode incoming command: unknown rpc type 80

Do you have any insight on why this might happen when the servers start in Java, but not when I start them from command line directly, or how to work around it?

otoolep commented 4 years ago

I think you're telling one node's Raft communication system to talk to the HTTP port of another node. Check your command line parameters. -join should take the HTTP address of a node, not the Raft address of that node.

hprovenza commented 4 years ago

I see - I must have got switched up but I did try it both ways - when I join the http port I get:

[cluster-join] 2020/09/02 11:36:42 failed to join cluster at [localhost:4007]: Post "http://localhost:4007/join": dial tcp [::1]:4007: connect: connection refused, sleeping 5s before retry
2020-09-02T11:36:44.332-0400 [WARN]  raft: no known peers, aborting election

otoolep commented 4 years ago

Well, there is nothing wrong with rqlite itself. It's extensively tested so you must be launching it incorrectly.

I suggest you create the cluster by hand, and compare the options passed to rqlite when launched manually with what your code is doing. There will be a difference.

rqlite / rqlite-java

Failover capabilities #9