Add option to switch eth2 clients with slashing DB transfer

torfbolt commented 3 years ago

Feature request:

Now that EIP-3076 is (being) implemented in the eth2 clients, it would be a good feature to have key & slashing db migration onto another validator client available through the rocketpool interface.

Why do we need this:

In case one of the supported clients has an issue, this would allow for a seamless and safe way for node operators to continue service.

moles1 commented 3 years ago

Yep, watching the client support for this feature closely :) It seems likely that both clients will provide a flag for the slashing DB path, in which case we will store it on disk under ~/.rocketpool/data/ and point the validator(s) to the same file. This way, they should pick up the slashing data if the client is switched

torfbolt commented 3 years ago

IMO that approach will not work, as the clients are using different DB formats internally. The only thing that is specified is the interchange format for import & export.

torfbolt commented 3 years ago

Thought about this again today. Would it be possible to automatically export the slashing DB whenever the validator has been stopped by rocketpool service stop? It could be accompanied by metadata what client it is from. Then a later rocketpool service start command could compare the currently configured eth2 client to this and import the DB before starting the validator, if the client has changed.

kidkal commented 3 years ago

Hi @torfbolt I believe slashing database import/export is quite a niche functionality that I think it will hardly be used and won't be worth the effort to automate and maintain.

The automatic slashing database conversion will not be useful when:

migrating to new wallet (different sets of validators)
normal day to day starting and stopping validator not switching clients
resyncing the chain data

It would be useful when:

when validator will not be impeded by chain resync - ie validator can start attesting straight away, if there's already up to date alternative beacon node
switching clients on a regular basis, often enough that manual export and import is too cumbersome

Please have a look at the respective clients' documentation on slashing protection:

Commands can be run in a docker container with: docker exec <container_name> <command> <args> For example, you can drop into a shell with: docker exec -ti rocketpool_validator bash then perform the export and import to and from /data/validators directory.

You can get further help on this from Rocketpool and clients' discord servers.

torfbolt commented 3 years ago

I agree that this is not something we will need on a day-to-day basis. It is more of a risk mitigation thing: if we ever get to the point where one of the clients has a serious bug and node operators need to switch clients (possibly under non-finality stress), there should be a plan for safe migration. Which should be tested in advance. Adding this functionality into the RP software would enable writing unittests for this. It's not that I personally would need help for this (I'm running service files instead of docker anyway), I see it as a resilience feature for the whole network that simplifies safe client switching for all node operators.

Maybe you are right and it isn't worth the effort. We could also just handle this with good documentation of the commands to run manually. But that still leaves the open point that a backup plan is only good if it is regularly tested.

I just took a quick look on the smartnode source code and it looks quite clearly laid out. Given the right docker commands, I would expect that it fits well into the architecture. I'll see if I can find some time in the coming weeks to hack together some code and test it.

kidkal commented 3 years ago

@torfbolt it would be mainly in the smartnode-install project but yeah have a go, it's good fun! The tricky part would be to save the last used client somewhere since containers could be in unknown state on shutdown.

On the client switching in non-finality, I am not convinced by this argument because when you fire up a new beacon node, it will take several hours to days to resync the chain anyway. In that time there will be no attestation and no slashing. In the time after sync the new client will be building up its own data on slashing protection. So I find the ability to export and import slashing database a bit redundant. As I said in the above comment, that this functionality won't be useful until you have the ability to "hot swap" validator clients. That is, have various and multiple redundant beacon node running, when an issue is encountered with the currently running validator, another validator can be fired up immediately using the slashing protection database. So, let me know how you get on with this project!

On a more general note, I don't think Rocketpool goes into resiliency that much, eg running backup beacon or geth nodes, up to hardware backups etc. It is an exercise left for each node operator. I think this is the right way to do it because people will have different views in how to do things and might even end up with a more resilient network than everybody following the same path.. maybe?

torfbolt commented 3 years ago

With the new restart command option available for everyone running outside of docker, this is probably out of scope. Now everyone who wants this feature can make a shell script to do so, but in a shit-hits-the-fan scenario it is probably anyway a lot safer to do any client switchover by hand.

rocket-pool / smartnode

Add option to switch eth2 clients with slashing DB transfer #76