Continuous Inter Node Block Synchronization Flow - proposal to change the condition for out-of-sync nodes

avilanthe1 commented 6 years ago

The condition of out-of-sync node where “no blocks are committed for some time” is very naive and I expect it to rarely happen in the case of out-of-sync node.

Let’s examine the cases that lead to an out of sync node:

The node joined to an (already) ongoing network.
The node had to restart its program.
Full connectivity loss.
Network partition.
The node is being censored by all the other nodes.

(any bugs in the code can cause to out-of-sync node but ,in general, we can divide different bugs to the cases above)

Now, “no block are committed for some time” does captures all of the above, but it also captures the case where there is some liveness issue in the protocol. A liveness issue is the most probable situation when no blocks are committed for quite some time. In a liveness issue case, all the nodes are synced but can't progress for some reason. In this case, if out-of-sync error start to pop up in one node, shortly all of the other nodes will have this error popped as well and the wrong error handlers will be activated deviating the nodes from the real problem (i.e., the nodes will handle out-of-sync error rather than the liveness issue).

Looking at the cases causing some node to be out of sync, it seems like there is easy solutions for each. The first two can be easily detected, so is a full connectivity loss. Considering the case of network partition one of the two may happen:

A full network stop: none will be out of sync. In that case it just matter of time until network is fully restored or some manual handling is required.
The other partition continues committing blocks: here, once the network is restored the node will be out-of-sync and he will be able to see other nodes gossiping “future” messages that doesn’t fit its own “present” state.

The last case is where other nodes purposely censoring the out of sync node. Note that this can be regarded the same as full connectivity loss but it is harder to detect. Any how, until the censored node doesn’t see any “future” message it can’t really count himself as out-of-sync since there is little (if any) he can do about it (asking the other nodes to help in syncing is useless in that case).

Summing-up the above, my proposal is to change the condition of a node changing its status to an out-of-sync. The condition should be the case the node encountered a valid and verified “future message”.

One proposal for such a “future message” can be: A block that is in height of at least (cur_height+2) and is properly signed and approved to be committed (by the consensus algorithm) by known public keys (not committee members, since the committee composition is unknown to the out-of-sync node).

Note that observing such a message captures all the above discussed cases causing an out-of-sync node. There are more messages that can be considered as “future messages” but I suggest for simplicity (especially in v1) just to use this kind of messages as an out-of-sync condition (I’m not sure adding more messages will optimize the protocol a lot - assuming block’s latency is low).

If you are not convinced, I want to give an example why I think the current condition is bad. Consider some synced node (soon is out-of-sync) and a committee that had recently reached an agreement on a block, but haven’t propagated that block yet in the network. Now, this synced node had some bug in its operating system causing him to miss all the messages propagated by that committee to the network. Assuming our node is the only one that hasn’t received the last round messages he is the only one who turns to be out-of-sync. Our out-of-sync node timer is ticking from the last committed block he have seen and meanwhile the network progress with more and more block being proposed, committed and propagated to the network (but missed by our out-of-sync node).

The question in matter is: when the out-of-sync node timer should start alarming so the node can start his syncing process?

One proposal might be: tune the timer to the maximal time that takes a single committee until a consensus is reached. It is very reasonable proposal since any time less than that might imply the committee is still in the process of consensus and any time more than that means that the committee has finished while the committed block has yet arrived. The problem is that this time can be rather long (relatively to the time it takes a committee in the optimistic case), and I argue that there might be a lot of blocks being committed until the timer's alarm goes on. For instance, in PBFT, committee timers (to replace an unresponsive leader) grow exponentially every time a view change happens.

To conclude, as nodes need to sync more of the blockchain the more burden out-of-sync nodes are for the network (more network overhead) and in any time interval there are expected to be more nodes that can’t participate in the consensus (slower protocol, the f-parameter should be increased).

talkol commented 6 years ago

We originally had the future message, the problem was that a future message can be faked with a byz committee. We can't know that this committee isn't the real one because we can only verify the random seed with the entire chain

avilanthe1 commented 6 years ago

Yes you are right.

The proposal is not about the acceptance of future blocks it is about the condition a node should put to himself to determine whether it is in sync or not.

Of course, there might be some Byzantine committee producing future blocks, bringing our node into a sync status. But when the node will query some other node for a sync - it will never be able to fool him over the next block since the committee of the next block is well known to our node.

In case there is some committee that keeps proposing future blocks, just to enable the syncing mechanism of all the other nodes, the worst that happens is a small performance problem but the protocol is still live and safe. Moreover, this problem is mitigated quite easily:

Technical solution - Nodes can ban a curtain committee composition on future blocks (only blocks with height > (cur_height + 1)) so their syncing mechanism won't turn on.
Incentive solution - Reputation should, anyway, diminish when there is a proof for not complying the protocol's instructions (e.g., signing on a block when not part of its committee in that height).

orbs-network / orbs-spec

Continuous Inter Node Block Synchronization Flow - proposal to change the condition for out-of-sync nodes #19