Open andresilva opened 3 years ago
We think nodes could determine when they're fully synced using this, which benefits the relative time or and approval assignments subprotocols. As discussed in https://github.com/w3f/polkadot-spec/pull/168 we'll need to expand upon the above document for this use case:
Initially, a validator Vlad starts up and begins syncing the relay chain. At start, Vlad loads its session secret keys and their certificates signed by the node's controller key. I presume the controller already registered this controller certificate bundle on-chain, but if not then tell me. We'll punt doing any runtime updates to a controller certificate bundle for another year.
We've now two back/full cert modes, automatic counter mode and manual --force-back-cert=[counter]
mode. In both, Vlad takes no action unless they know secret keys for the controller certificate bundle registered on-chain, meaning they wait but also maybe they resume waiting. In automatic mode, Vlads waits longer until they observe "mostly sensible timestamps", and then determines the old counter
from their back/full cert on chain, or sets counter=0
if we've no back/full cert registered on-chain.
In both modes, Vlad creates a fresh tag
and a fresh grandpa back/full cert containing tag
, counter=counter+1
, a timestamp, its controller public key, and its controller certificates bundle hash. Vlad signs this back/full cert with all their grandpa keys, so Ed25519, ECDSA secp256k1, and BLS, and most others too, so BABE sr25519, Sassafras JubJub Ring-VRF, etc. Vlad gossips the signed back/full cert to other validators.
Any relay chain block producer should include a fresh back/full cert only if the tag
changes, the counter
increases, the timestamp is somewhat sensible, and if the controller certificates bundle hash matches that registered under its controller public key.
We could miss-judge the chain sync slowing in automatic mode, due to the "mostly sensible timestamps" heuristic. We thus reissue back/full certs with larger counter
values as those come in, except we reissue with exponential back off. We permanently halt reissues with counter
increases if any back/full cert with our tag
ever gets finalized.
At some point, Vlad witnesses live grandpa votes finalizing its back/full cert, so then Vlad knows it lags behind the chain head by less the grandpa finality time, and less than the time since it issued that back/full cert.
We address how various subsytems handle this information in automatic and manual mode:
We should certify long-term transport keys from the controller, but they should implicitly provide their own back/full cert at the transport layer when opening connections, so they need not participate in this on-chain system really. I've no idea how much this exists but we can ask @tomaka
We two niggling questions remaining:
tag
be a public key for a fresh sr25519 or ed25519 key that never leaves the node. Is this useful somewhere? I think this superseds paritytech/polkadot-sdk#93
This issue has been mentioned on Polkadot Forum. There might be relevant details there:
https://forum.polkadot.network/t/ux-of-distributing-multiple-binaries-take-2/2854/2
In order to avoid "benign" equivocations that are caused by operational errors (e.g. restoring an old database while losing the grandpa voter state could lead the authority to vote twice for the same round) we should introduce a more robust protocol for key registration and usage, thus making sure that session keys aren't reused in the same context.
A protocol is suggested here https://hackmd.io/@rgbPIkIdTwSICPuAq67Jbw/BkCOQ8CvP but should still undergo further formalization.