hitchyjs / scull

Raft Consensus for Node.js, backed by LevelDB
MIT License
3 stars 4 forks source link

Put event on peers #5

Open kevinkleine opened 6 years ago

kevinkleine commented 6 years ago

When I put something in the db (memdown), a "put" event is emitted on that peer (A). On the other peers (B and C), the data seems to be updated silently. How can I get a notification on peer B that the data has changed, following a put on peer A?

kevinkleine commented 6 years ago

Got it:

const changes = somePeer.createReadStream({
    live: true
})

changes.on('data', function(data) {
    console.log( 'change', data)
})
soletan commented 6 years ago

perfect! ;)

kevinkleine commented 6 years ago

Is there a specific event I should wait for before creating the stream and/or subscribing to data on it? My tests are stochastically successful... half the time they work perfectly, the other half, no updates on the handler.

soletan commented 6 years ago

Not that I know of ... but that's mostly due to me having taken care of revising libskiff into scull recently and thus haven't found any time to investigate this part of scull. Maybe check the life cycle events of scull, such as leader as it marks the moment when there is a leader eventually thus cluster having become operational.

One of the major revisions we've been working on is switching most parts of code from NodeJS style callbacks into working with promises so timing becomes a lot easier to control. But those revisions haven't been released before due to having ongoing issues w/ resilience tests of cluster. To my regrets, it might take quite some more time to fix those tests.

soletan commented 6 years ago

Not that I know of ... but that's mostly due to me having taken care of revising libskiff into scull recently and thus haven't found any time to investigate this part of scull. Maybe check the life cycle events of scull, such as ~leader~ elected as it marks the moment when there is a leader eventually thus cluster having become operational.

One of the major revisions we've been working on is switching most parts of code from NodeJS style callbacks into working with promises so timing becomes a lot easier to control. But those revisions haven't been released before due to having ongoing issues w/ resilience tests of cluster. To my regrets, it might take quite some more time to fix those tests.

kevinkleine commented 6 years ago

After implementing an "elected"-event waiting period and some more testing it now seems that put commands on the leader get replicated to the the followers, but not the other way around. Any idea what I could be doing wrong? Are there additional events or different methods on the follower that I should be using?

soletan commented 6 years ago

I think there is a basic misunderstanding of Raft concensus protocol as it is used by scull. Replication always works from leader to followers. Any follower interested in putting something into database needs to query the current leader asking to do that. scull is designed to handle this for you transparently. By putting on follower node the request is forwared to current leader. The leader is then adjusting state of cluster, waiting for consensus prior to actually adjusting selected key in database. Asking for consensus is equivalent to replication here.

kevinkleine commented 6 years ago

On the follower I have tried a number of different things:

await ( new Promise( r => shell.levelup.put( key, values, r ) ) )

which does not trigger updates on the leader

await ( new Promise( r => shell.peers().then( peers => shell.waitFor( peers ).then( () => {
    console.log( '"waitFor" finished' )
    r()
} ) ) ) )
await ( new Promise( r => shell.levelup.put( key, values, r ) ) )

which always times out and never gets to the 'put'

shell.peers().then( peers => shell.waitFor( peers ) )
await ( new Promise( r => shell.levelup.put( key, values, r ) ) )

this works once, the leader 'sees' the update, the second update never happens as the first waitFor seems to timeout: Error: timeout RPC to /ip4/127.0.0.1/tcp/7891, action = Command Error: timed out waiting for consensus

In my test I have two nodes/shells running side-by-side. No options have been set in the constructor other than the peer address and the db (memdown). Any ideas how I can get this to work?

soletan commented 6 years ago

Could you provide your code for setting up this two-node-setup here? It would help to see a single test script file that's starting up the network and running the test you are performing here. I'm hoping to see a probable issue w/ your setup ;) ... but I fear there isn't any and this amounts to the problems probably found in original libskiff as well.

I was trying to reproduce this case using resilience tests as of current development version (found in branch develop here on GitHub). Those tests are setting up a cluster running for several minutes while keep randomly picking one of the nodes to request randomly putting/getting a randomly chosen levelDB key. Basically those tests work with a 3-node cluster but I don't think having just 2 nodes is an issue here. Nonetheless, I was trying to run such a resilience test on a 2-node cluster right now and it wasn't having any issues reading/writing +19000 times.

However, all this amounts to the current development version which is much different from latest release. It is those resilience tests occasionally failing when run in a more chaotic way (by randomly killing and reviving nodes of cluster during the test) which is keeping me from releasing all improvements put into current development version. Though, it doesn't mean the resilience tests have been more stable in either original libskiff or current release of libscull. Thus I'm pretty indecisive on whether releasing this development version or not and how to do it while promoting those problems as known issues to be fixed some day soon. I'd like to see resilience tests proving any implementation to be production-ready for running clusters prior to releasing something here.