Block Producer Redundancy and Automatic Failover

A block producer responsible for the operation of a critical network function should be able to run at least two live and independent block producing software and hardware stacks. Both stacks produce blocks for the same registered producer account and use the same signing key so as to be identical in identity and function from the viewpoint of the network.

The obvious benefit is that any failure, at any time, in any one stack should be completely transparent, and unnoticeable, to the network as a whole. Blocks continue to be produced and no service degradation is observed.

Side benefits include being able to take stacks offline for maintenance and upgrades without impacting service-level commitments. The process for doing this now, as I see it, is to have a bag of sand in one hand and sweat nervously while quickly swapping with the gold statue in the other hand.

So can this be done now? I gave it a test with two nodeos instances on the Telos Testnet with leap v5.0.2. The answer is... sort of.

The connection routing was one node to two external P2P peers, and the other node connected to just the first node's P2P endpoint.

With this arrangement, block producing appeared to be unaffected in the sense that producer table shows no increase in missed blocks and the block production counter increased.

However there was significant churn in the logs with a lot of "Unlinkable block" exceptions, and the second node was "booted" from the P2P connection for a short while. The nodes would link and sync up again, and the process would repeat the next production round.

Certainly less than ideal, but I don't see why this can't be tweaked and cleaned up to allow for (and reject/select) duplicate blocks cleanly.

Update: with the second producing node connected only to the first node's P2P endpoint, I wasn't sure if it was the node's P2P incoming block filter logic that was responsible for the noise, and hence limited to just that node, or if it would get out to the rest of the network.

Well, I received feedback from a 3rd party confirming that it was observed as "double producing" node, so that's that.

It is possible that an automatic failover solution can still be implemented through a fallback producer monitoring the realtime production of blocks by a primary producer and gating blocks from the designated fallback producer while those from the primary producer are received. This would have to happen quickly (within 10's of milliseconds) as the goal is to maintain fully redundant never-miss-a-block service level.

measurementearth / me-leap

Block Producer Redundancy and Automatic Failover #1