royaltm / node-zmq-raft

An opinionated Raft implementation powered by ØMQ
29 stars 5 forks source link

Trying to understand how to dynamically add a new peer #21

Closed psvensson closed 1 year ago

psvensson commented 1 year ago

Hello, I am again trying to find a good way to make a new raft process add itself as a new peer to an existing one. I have made sure to only use external (non-localhost) addresses to avoid confusion.

The examples is quite clear on how to do this using the stand-alone zr-config utility but it still eludes me how to properly implement this in my own code.

I have made a small cli program which I use to start the instances with, where I can configure the needed parameter to the raft builder. The output of the first instance starting is this; zmq-raft:builder building raft server +0ms zmq-raft:builder ensuring directory data.path: ".raft/data" +0ms zmq-raft:builder initializing persistence +1ms zmq-raft:builder initializing log +0ms zmq-raft:filelog opening log: ".raft/data/log" +0ms zmq-raft:builder initializing state machine +1ms zmq-raft:file-persistence file ".raft/data/raft.pers" opened for reading: 20 +0ms zmq-raft:snapshotfile reading ".raft/data/snap" +0ms zmq-raft:file-persistence finished reading: 27 bytes +1ms zmq-raft:file-persistence file ".raft/data/raft.pers" opened with flags: a +1ms zmq-raft:file-persistence ready: {"currentTerm":1,"votedFor":"id2","peers":[{"id":"id1","url":"tcp://192.168.86.210:8047"}],"peersUpdateRequest":null,"peersIndex":null} +0ms zmq-raft:snapshotfile read ".raft/data/snap" logIndex: 0, logTerm: 0, dataSize: 0 +1ms zmq-raft:snapshotfile ready ".raft/data/snap" +0ms zmq-raft:indexfile opening existing: ".raft/data/log/00000/00/00/00000000000001.rlog" +0ms zmq-raft:indexfile opened: 00000000000001 +0ms zmq-raft:filelog reading request ids from index files down to: 1 +4ms zmq-raft:filelog read: 0 requests in 0.0 seconds +0ms zmq-raft:filelog first index: 1, last index: 0, last term: 0 +0ms zmq-raft:builder initializing raft peer: "id1" +4ms zmq-raft:builder prevent spiral elections: true +0ms zmq-raft:leader becoming -+LEADER+- index: 0 term: 2 +0ms zmq-raft:builder raft-state: Symbol(Leader) term: 2 +55ms zmq-raft:service ready at: tcp://192.168.86.210:8047 +0ms zmq-raft:service term: 2 votedFor: "id2" +0ms zmq-raft:service peers: 0 peersUpdateRequest: null peersIndex: null +0ms zmq-raft:service commit: 0 applied: 0 +0ms

I then tried to start the new instance using a new id and port (and data path), using an array of peers which only included the first instance, but the debug output showed no connection attempt.

I then added (for the second instance) its own id and url to the peer array, and this is then the output from the second instance;

zmq-raft:builder building raft server +0ms zmq-raft:builder ensuring directory data.path: ".raft/data2" +0ms zmq-raft:builder initializing persistence +1ms zmq-raft:builder initializing log +0ms zmq-raft:filelog opening log: ".raft/data2/log" +0ms zmq-raft:builder initializing state machine +0ms zmq-raft:file-persistence no such file: ".raft/data2/raft.pers", creating new state +0ms zmq-raft:file-persistence file ".raft/data2/raft.pers" opened with flags: ax +0ms zmq-raft:file-persistence ready: {"currentTerm":0,"votedFor":null,"peers":[{"id":"id1","url":"tcp://192.168.86.210:8047"},{"id":"id2","url":"tcp://192.168.86.210:8088"}],"peersUpdateRequest":null,"peersIndex":null} +0ms zmq-raft:snapshotfile reading ".raft/data2/snap" +0ms zmq-raft:snapshotfile creating ".raft/data2/snap" logIndex: 0, logTerm: 0, dataSize: 0 +0ms zmq-raft:snapshotfile ready ".raft/data2/snap" +78ms zmq-raft:indexfile creating: 00000000000001 with capacity: 16383 +0ms zmq-raft:filelog reading request ids from index files down to: 1 +170ms zmq-raft:filelog read: 0 requests in 0.0 seconds +0ms zmq-raft:filelog first index: 1, last index: 0, last term: 0 +1ms zmq-raft:builder initializing raft peer: "id2" +171ms zmq-raft:builder prevent spiral elections: true +0ms zmq-raft:builder raft-state: Symbol(Follower) term: 0 +6ms zmq-raft:service ready at: tcp://192.168.86.210:8088 +0ms zmq-raft:service term: 0 votedFor: null +0ms zmq-raft:service peers: 1 peersUpdateRequest: null peersIndex: null +0ms zmq-raft:service commit: 0 applied: 0 +0ms raft peers: Map(1) { 'id1' => 'tcp://192.168.86.210:8047' } zmq-raft:leader election timeout term: 1 +0ms zmq-raft:builder raft-state: Symbol(Candidate) term: 1 +525ms zmq-raft:rpc-socket rpc.connect: tcp://192.168.86.210:8047 +0ms

Here we see a last line that it did not output on the first (not shown) attempt. It does try to connect to the first instance. Success. But, this is then the output n the window of the first instance;

zmq-raft:service request vote rpc: no such peer: id2 +31s

It was mentioned before that this should be possible (and is indeed possible using the zr-config example) but I seem to be missing something.

psvensson commented 1 year ago

Hmm. After a while on the 'id2' service, I do see some activity which feels like it is doing the right thing;

zmq-raft:leader election timeout term: 1 +0ms zmq-raft:builder raft-state: Symbol(Candidate) term: 1 +570ms zmq-raft:rpc-socket rpc.connect: tcp://192.168.86.210:8047 +0ms zmq-raft:rpc-socket rpc.request queue full: tcp://192.168.86.210:8047 +11m zmq-raft:rpc-socket rpc.request drain: tcp://192.168.86.210:8047 +59s zmq-raft:rpc-socket rpc.request queue full: tcp://192.168.86.210:8047 +3m zmq-raft:rpc-socket rpc.request drain: tcp://192.168.86.210:8047 +1m zmq-raft:rpc-socket rpc.request queue full: tcp://192.168.86.210:8047 +2m zmq-raft:rpc-socket rpc.request drain: tcp://192.168.86.210:8047 +873ms zmq-raft:rpc-socket rpc.request queue full: tcp://192.168.86.210:8047 +15s zmq-raft:rpc-socket rpc.request drain: tcp://192.168.86.210:8047 +1m zmq-raft:rpc-socket rpc.request queue full: tcp://192.168.86.210:8047 +2m zmq-raft:rpc-socket rpc.request drain: tcp://192.168.86.210:8047 +1s

But I had expected to get some kind of response from the 'id1' instance DEBUG?* log that a new peer had joined, instead of the quite opposite feeling of the log 'no such peer: id2

However, it might be that the logs are normal and I'm jsut misinterpreting them.

psvensson commented 1 year ago

Continuing my investigation, I have now tried the following tactics to understand where my configuration fails;

  1. Tried to use localhost/127.0.0.1 instead of an external address, but that yielded (predictably) the same result.
  2. Used the 'tcp:*:8047' format for router url instead of the full IP addres (I earlier re-used the address both for building the peer list and in the router configuration), but that did not help either (admittedly low probability, but who knows).
  3. Wrote three shell-scripts which started an instance each with full peer list of the other instances, in case a full quorum was needed (as it is in the README example) before adding a new peer. This did not work either.

I have pushed the latest WIP with the shell-scripts (which need to be flagged executable and run from the root like './src/start1.sh') which might be easier to read the debug output of since it will then be in respectable colour.

https://github.com/psvensson/raft-runner

I fully understand that this is an open-source project and can only be maintained when time allows. So no panic on my part, I just happened to have a wake available t the moment.

psvensson commented 1 year ago

I have now understood why I have been having so much problems.

In the README it is mentioned that there is an example state machine that can be used. As I just wanted to get the cluster up and running first, and implement my own, I focused on that first. I got the impression that the state machine was separate from the raft cluster, but could react to logs as needed (and would need to implement the optional log compaction at some stage of course).

In the first part of the README, the builder example does not include 'pub' properties for the sample state machine, and that was why I got a feeling it might not be needed.

Now that I added a broadcast part to my config (and add the 'pub' property to the peer list) I am able to get a three-instance cluster up and running using my own code, exactly as is done with the zmq-raft.js examples. Phew.

My last remaining question is then if it should be possible to start a cluster without a broadcast state machine, or if this is an intrinsic part of making things work?

royaltm commented 1 year ago

Broadcast State Machine is just an implementation of the state machine API protocol and can be replaced with another implementation.

The idea behind the BSM was to make the RAFT servers state machine implementation opaque. This way there is no need to understand of what data in the log actually mean to the raft server peers. Instead BSM treats the received log updates as opaque blobs and redistributes them as they come.

BSM allows any number of clients to connect to such a cluster and get state changes rapidly distributed among them.

I'm pretty sure I covered this topic here: https://github.com/royaltm/node-zmq-raft#use-cases

BSM is essential if you want to have a small number of raft peers (responsible for consensus) and utilize the larger number of zmq-raft clients which aren't RAFT peers. BSM distributes the state changes to those clients.

A real-life use case that this project was ignited by was having a 5 (and later 7) - peer RAFT cluster on dedicated servers and 1000 web services that were merely RAFT clients, providing a real-time distributed feedback to user's browser application. Such a set up prompted the usage of BSM and in fact BSM was born because of it.

An alternative use case in real life was to have a constant number of RAFT peers (on a number of installed touch panel devices), which were running in-process with another node.js application, providing consensus of the system state among the devices. In this instance the BSM was replaced with a custom state machine implementation (for DATALOG update stream).

psvensson commented 1 year ago

Thank you for the detail and example. And it was a good description of the two use-case - I have read that part quite a bit the last week.

The way I understood it then, was that one need to have a state machine, or instead make use of the provided BSM (for use with remote clients).

What I did then was implement my own, very small state machine and therefore I did not add the 'pub' property to the peer list. However, no peers could join the first node unless I added the 'pub' property for a special port for the BSM to use.

That is what I'm confused about at the moment. It does work, and my own custom state machine gets state properly and everything, but I don't understand why the BSM need to be involved at all.

However, I now have a good place to work from, and will start figuring out how to dynamically add nodes next

royaltm commented 1 year ago

Hi, I'm closing this issue, because it's not really an issue.

Feel free to ask questions and post updates in the Discussions.