grandpa: extend protocol for session key registration and usage

andresilva commented 3 years ago

In order to avoid "benign" equivocations that are caused by operational errors (e.g. restoring an old database while losing the grandpa voter state could lead the authority to vote twice for the same round) we should introduce a more robust protocol for key registration and usage, thus making sure that session keys aren't reused in the same context.

A protocol is suggested here https://hackmd.io/@rgbPIkIdTwSICPuAq67Jbw/BkCOQ8CvP but should still undergo further formalization.

burdges commented 3 years ago

We think nodes could determine when they're fully synced using this, which benefits the relative time or and approval assignments subprotocols. As discussed in https://github.com/w3f/polkadot-spec/pull/168 we'll need to expand upon the above document for this use case:

Initially, a validator Vlad starts up and begins syncing the relay chain. At start, Vlad loads its session secret keys and their certificates signed by the node's controller key. I presume the controller already registered this controller certificate bundle on-chain, but if not then tell me. We'll punt doing any runtime updates to a controller certificate bundle for another year.

We've now two back/full cert modes, automatic counter mode and manual --force-back-cert=[counter] mode. In both, Vlad takes no action unless they know secret keys for the controller certificate bundle registered on-chain, meaning they wait but also maybe they resume waiting. In automatic mode, Vlads waits longer until they observe "mostly sensible timestamps", and then determines the old counter from their back/full cert on chain, or sets counter=0 if we've no back/full cert registered on-chain.

In both modes, Vlad creates a fresh tag and a fresh grandpa back/full cert containing tag, counter=counter+1, a timestamp, its controller public key, and its controller certificates bundle hash. Vlad signs this back/full cert with all their grandpa keys, so Ed25519, ECDSA secp256k1, and BLS, and most others too, so BABE sr25519, Sassafras JubJub Ring-VRF, etc. Vlad gossips the signed back/full cert to other validators.

Any relay chain block producer should include a fresh back/full cert only if the tag changes, the counter increases, the timestamp is somewhat sensible, and if the controller certificates bundle hash matches that registered under its controller public key.

We could miss-judge the chain sync slowing in automatic mode, due to the "mostly sensible timestamps" heuristic. We thus reissue back/full certs with larger counter values as those come in, except we reissue with exponential back off. We permanently halt reissues with counter increases if any back/full cert with our tag ever gets finalized.

At some point, Vlad witnesses live grandpa votes finalizing its back/full cert, so then Vlad knows it lags behind the chain head by less the grandpa finality time, and less than the time since it issued that back/full cert.

We address how various subsytems handle this information in automatic and manual mode:

Approval assignments always waits until GRANDPA finalizes our back/full cert.
GRANDPA, BABE/Sassafras and relative time both similarly wait in automatic mode. In fact, GRANDPA waits until three blocks after GRANDPA finalizes our back/full cert. Yet, both begin immediately using their system clock in manual mode. Actually manual mode exists primarily to save the chain from locked states when too many validators drop out, etc.
Anything slashable for equivocations like GRANDPA and BABE/Sassafras never signs any block after which our tag changes, which then avoids node operators being slashed for equivocations.

We should certify long-term transport keys from the controller, but they should implicitly provide their own back/full cert at the transport layer when opening connections, so they need not participate in this on-chain system really. I've no idea how much this exists but we can ask @tomaka

We two niggling questions remaining:

We could make tag be a public key for a fresh sr25519 or ed25519 key that never leaves the node. Is this useful somewhere?

burdges commented 3 years ago

I think this superseds paritytech/polkadot-sdk#93

Polkadot-Forum commented 1 year ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/ux-of-distributing-multiple-binaries-take-2/2854/2

paritytech / substrate

grandpa: extend protocol for session key registration and usage #7398