Casper node failover - Githubissues

Context Current Casper node operations do not include an automatic failover system. This leads to potential service disruptions if the primary node fails. The concept involves a primary (main) and secondary (slave) node system, where each node is monitored, and failover is triggered based on specific conditions.

Goal The goal of this task is to implement or use a failover module for the Casper node. This module should ensure high availability of the network services by automatically switching to a backup node if the primary node fails. The failover mechanism should be efficient, with a minimal performance drop during the switch and should avoid double-signing to prevent penalties.

Requirements

Dual-node architecture: one main and at least one slave.
Each node should have two public keys: one for main operations and the second for failover scenarios.
Nodes must regularly ping each other, with intervals and monitoring duration configurable through settings.
The failover process should activate the slave node as the primary if the main node becomes unresponsive for a specified period.
Include internal replication between main and slave nodes to prevent performance degradation during failover.
The system to avoid double-signing, leveraging experience from existing solutions like Horcrux in the CosmosSDK ecosystem.
The solution should seamlessly revert to the original configuration once the main node is operational again.

References:

Review existing failover mechanisms in blockchain systems, such as the custom Tendermint fail-tolerance applications by Farbole, Figment, and CertusOne.
Analyze the Horcrux threshold Tendermint signer as a model for the Casper node failover system.

teonite / casper-node

Casper node failover #13