penumbra-zone / penumbra

Penumbra is a fully private proof-of-stake network and decentralized exchange for the Cosmos ecosystem.
https://penumbra.zone
Apache License 2.0
381 stars 295 forks source link

Tracking Issue: Dynamic Epochs #2070

Open avahowell opened 1 year ago

avahowell commented 1 year ago

Is your feature request related to a problem? Please describe.

Currently, the epoch system in Penumbra's state machine is static: we have an is_epoch_end method on chain.Epoch which tracks the epoch cycle using the configured static epoch duration (in blocks) and index. It's up to components (such as the staking component) to design their own state transition logic around this simple helper function, which assumes that epochs have a static length in blocks. There is no global (i.e., at the Component level) notion of the current epoch, or current position in the epoch. In practice, this leads to quite of bit of complexity and challenges, such as for handling mid-epoch slashing events. Currently all of the mechanisms in Penumbra assume static-length epochs and handle this complexity on a case-by-case basis.

Describe the solution you'd like

Implement dynamic epochs that can be cleanly interrupted by various components of the state machine. For this change to preserve the current mechanism design of Penumbra, we need to also rework the mechanisms (such as staking rate) to operate based on clock time rather than index in the epoch. This second piece also allows us to safely use the Tendermint skip-empty-blocks flag, improving performance.

Using wall clock time for mechanisms has its own drawbacks. There are potential concerns around manipulation of timestamps (the source of truth for time-based epochs) by malicious (1/3 + 1) quorums of validators (see https://github.com/cosmos/cosmos-sdk/issues/5411 and https://github.com/tendermint/tendermint/issues/2653 for deeper discussion of this issue). We can avoid the bulk of concern (h/t @zmanian) by adding a synchrony assumption: Penumbra nodes should reject blocks whose timestamp is >~1hr different than their own. This aligns with the Node Based Rejection approach described here.

hdevalence commented 1 year ago

I think it makes sense to conceptually separate two distinct but related changes here:

  1. Changing from static to dynamic epochs;
  2. Making Penumbra's economic mechanisms, like staking rates, time-based.

On (1), the original reason to have epochs in the first place was so that we could ensure that the validator set only changes at defined epoch boundaries, rather than in any arbitrary block. This is necessary so that when we have the validator set do threshold cryptography, we can ensure that the set of participants in the MPC isn't constantly changing, and that when it does change, there's a place to do epoch-specific setups, like a DKG.

However, what we've actually built doesn't achieve this goal: although "normal" validator updates happen only at epoch boundaries, validators can be slashed in any block, and so although we have an epoch system, we can also have validator set changes outside of epoch boundaries. This means that we will have no way to, e.g., regenerate a threshold key if validators are slashed, and we have to do validator update work at the end of every block.

If our goal is to ensure that the validator set only changes at defined epoch boundaries, slashing forces us to have dynamic epochs, because we need to remove validators immediately after observing misbehavior.

Once we've reached that conclusion, though, we're confronted with the problem of how economic mechanisms should handle epochs of varying length. (2) is a solution to that problem -- making all economic mechanisms dependent only on (consensus) time means we can make them completely independent not just of the epoch length but also of the block interval. In other words, we unlock not only dynamic epochs but dynamic block times, letting us skip empty blocks.

[Aside: on its own, this isn't quite enough to let us skip empty blocks -- we'd also have to shift from the ABCI 1.0 "delayed execution" model where the apphash after execution of block h appears in the header of block h+1 to the ABCI 2.0 "immediate execution" model where the apphash after execution of block h appears in the header of block h. The reason is that, in the delayed execution model, skipping empty blocks is only possible if the apphash doesn't change after executing an empty block. Otherwise, Tendermint needs to produce a new block to record the changed apphash, which creates a new apphash, which needs a new block, ... and so on. And, in Penumbra, the apphash will change after execution of an empty block, because the SCT root will change as the SCT block tree is sealed. But it's still worth aiming for and hoping to unlock as part of a migration to ABCI 2.0.]

Finally, a key thing to note on (2) is that we're not in a situation where we have a bunch of working economic mechanisms already and we need to change them to be time-based rather than block-height based. Instead, we're in a situation where we have one or two stub mechanisms that only work well enough for an MVP and that we need to replace anyways. For instance, our staking rates are currently hardcoded (https://github.com/penumbra-zone/penumbra/issues/1481), and we should instead be having them pull from a distributions component (https://github.com/penumbra-zone/penumbra/issues/1934). So I think it makes sense to build the correct versions as time-based ones.

I think it makes sense to split this work into three phases: