Open avahowell opened 1 year ago
I think it makes sense to conceptually separate two distinct but related changes here:
On (1), the original reason to have epochs in the first place was so that we could ensure that the validator set only changes at defined epoch boundaries, rather than in any arbitrary block. This is necessary so that when we have the validator set do threshold cryptography, we can ensure that the set of participants in the MPC isn't constantly changing, and that when it does change, there's a place to do epoch-specific setups, like a DKG.
However, what we've actually built doesn't achieve this goal: although "normal" validator updates happen only at epoch boundaries, validators can be slashed in any block, and so although we have an epoch system, we can also have validator set changes outside of epoch boundaries. This means that we will have no way to, e.g., regenerate a threshold key if validators are slashed, and we have to do validator update work at the end of every block.
If our goal is to ensure that the validator set only changes at defined epoch boundaries, slashing forces us to have dynamic epochs, because we need to remove validators immediately after observing misbehavior.
Once we've reached that conclusion, though, we're confronted with the problem of how economic mechanisms should handle epochs of varying length. (2) is a solution to that problem -- making all economic mechanisms dependent only on (consensus) time means we can make them completely independent not just of the epoch length but also of the block interval. In other words, we unlock not only dynamic epochs but dynamic block times, letting us skip empty blocks.
[Aside: on its own, this isn't quite enough to let us skip empty blocks -- we'd also have to shift from the ABCI 1.0 "delayed execution" model where the apphash after execution of block h
appears in the header of block h+1
to the ABCI 2.0 "immediate execution" model where the apphash after execution of block h
appears in the header of block h
. The reason is that, in the delayed execution model, skipping empty blocks is only possible if the apphash doesn't change after executing an empty block. Otherwise, Tendermint needs to produce a new block to record the changed apphash, which creates a new apphash, which needs a new block, ... and so on. And, in Penumbra, the apphash will change after execution of an empty block, because the SCT root will change as the SCT block tree is sealed. But it's still worth aiming for and hoping to unlock as part of a migration to ABCI 2.0.]
Finally, a key thing to note on (2) is that we're not in a situation where we have a bunch of working economic mechanisms already and we need to change them to be time-based rather than block-height based. Instead, we're in a situation where we have one or two stub mechanisms that only work well enough for an MVP and that we need to replace anyways. For instance, our staking rates are currently hardcoded (https://github.com/penumbra-zone/penumbra/issues/1481), and we should instead be having them pull from a distributions component (https://github.com/penumbra-zone/penumbra/issues/1934). So I think it makes sense to build the correct versions as time-based ones.
I think it makes sense to split this work into three phases:
Component
trait, so that components aren't individually responsible for triggering epoch-related behavior. To do this, we'll want to:
CompactBlock
out of the shielded pool component and into the top-level App
. We need to do this because the SCT needs to be sealed differently depending on whether we're ending a block or a block and an epoch, and the resulting SCT roots need to be included in the CompactBlock
. Since we want the App
to be responsible for triggering epoch-related behavior, not any component, we can't have the SCT sealing and CompactBlock
construction be part of a component. This will also remove the restriction that the shielded pool always executes last (which in retrospect is a sign that it was doing things it shouldn't be).Component::end_epoch
trait method, and refactor existing end_epoch
component code into it.App
trigger end_epoch
based on the existing static scheduling.Epoch
data modeling. Currently, this holds an index
and a duration
. Should we change to just be an index
, or an index
, start_height
, end_height
?"chain/epoch_by_height/{height:010}" -> epoch_index
, using 0-padding so that the lex order of keys we use in iteration is our numeric ordering? What RPC methods do we need to allow clients to query for current or historical epochs, now that they can't compute them locally?Epoch::from_height
with these new APIs.MockClient
and in the Rust view server to check if an epoch root is present in the compact block and use its presence or absence to determine whether to trigger an epoch transition in its copy of the SCT, now that they are dynamic.max_clock_drift
in seconds, and make max_clock_drift = 0
mean the check is disabled, and make that be the Default
behavior. This way, we can continue to run state machine tests with that one single check disabled that are completely independent of local wall-clock time, and test that check in isolation.
Is your feature request related to a problem? Please describe.
Currently, the epoch system in Penumbra's state machine is static: we have an
is_epoch_end
method onchain.Epoch
which tracks the epoch cycle using the configured static epochduration
(in blocks) and index. It's up to components (such as the staking component) to design their own state transition logic around this simple helper function, which assumes that epochs have a static length in blocks. There is no global (i.e., at the Component level) notion of the current epoch, or current position in the epoch. In practice, this leads to quite of bit of complexity and challenges, such as for handling mid-epoch slashing events. Currently all of the mechanisms in Penumbra assume static-length epochs and handle this complexity on a case-by-case basis.Describe the solution you'd like
Implement dynamic epochs that can be cleanly interrupted by various components of the state machine. For this change to preserve the current mechanism design of Penumbra, we need to also rework the mechanisms (such as staking rate) to operate based on clock time rather than index in the epoch. This second piece also allows us to safely use the Tendermint
skip-empty-blocks
flag, improving performance.Using wall clock time for mechanisms has its own drawbacks. There are potential concerns around manipulation of timestamps (the source of truth for time-based epochs) by malicious (1/3 + 1) quorums of validators (see https://github.com/cosmos/cosmos-sdk/issues/5411 and https://github.com/tendermint/tendermint/issues/2653 for deeper discussion of this issue). We can avoid the bulk of concern (h/t @zmanian) by adding a synchrony assumption: Penumbra nodes should reject blocks whose timestamp is >~1hr different than their own. This aligns with the
Node Based Rejection
approach described here.