status-im / nimbus-eth2

Nim implementation of the Ethereum Beacon Chain
https://nimbus.guide
Other
516 stars 222 forks source link

Zero-downtime restart and upgrade procedures #1539

Open zah opened 4 years ago

zah commented 4 years ago

Software updates and planned restarts should be handled with zero-downtime. This is a challenge because deploying a new installation takes time and any restart is associated with reloading at least some run-time state from the database and reconnecting to the network .

To solve this problem and to address the planned long-term merge of our Eth1 and Eth2 clients, we can introduce a new scheme for our distributed binaries:

1) There is a single binary called nimbus used to launch and control other processes. 2) Specialized binaries for different functions such as beacon_node, validator_client and eth1_node exists in versioned subfolders. 3) The upgrade procedure consists of deploying new versioned binaries and launching a hand-off procedure. 4) The hand-off procedure starts the new binaries and allows them to sync with the network before assigning any validator duties to them. 5) Once the new nodes are synced, the validator duties are re-assigned through a safe two-phase commit protocol. 6) The user can roll-back to a previous version quickly in case of problems

arnetheduck commented 4 years ago

This seems to introduce a lot of complexity that has a number of existing partial solutions for a relatively small benefit, if we're to implement it fully - generally, forwards and backwards compatibility is needed for a small piece of the software: slashing protection database and validator keys mainly.

Most upgrades don't touch the database format and don't need a new sync, thus this complicated infrastructure is only occasionally needed.

It's relatively easy to start a new node, so what's really needed is a way to transfer keys & slashing protection reliably - the rest can already be solved with existing package managers (like nix), or simply by keeping installations in separate folders.

Writing an orchestrator of this sort is something that can easily balloon into a fully fledged system monitor a la systemd or similar offerings which is way out of scope for the project.

mratsim commented 2 years ago

I think with over 1 year of production hindsight, the most important thing was for Nimbus to restart fast enough and for Nimbus to display the next validator duty time so that users can choose a safe window.

Only part left is dealing with sync committee duty (#3281)

zah commented 2 years ago

Development of this feature is still part of our GUI-only user experience roadmap. It's about having a simple command that the user can execute without worrying for missed attestations.

tersec commented 2 years ago

https://github.com/status-im/infra-role-beacon-node-linux/commit/558b4069 provides an example of how to do this.