BABE skipped epochs - Githubissues

andresilva commented 1 year ago

Whenever there is a period of inactivity (i.e. no blocks being produced) for more than one epoch BABE is broken, since we announce epochs in advance and by the time the inactivity is resolved we will be in a future epoch which was unnanounced (i.e. no authorities and randomness available). The security model of BABE is also broken in this scenario, but for a long time in the Substrate implementation at least, this would completely brick the chain and make it impossible to author new blocks. As of https://github.com/paritytech/substrate/pull/11727 this was changed, such that empty epochs can be skipped over, i.e. after a period of inactivity the runtime will appropriately update the epoch index based on the given slot. We are still left with the problem of what epoch data to use and the natural thing to do is to re-use the data for the latest announced epoch. As an example, assuming the last block was authored on a slot at epoch 3, the data for epoch 4 had already been announced. Given a period of inactivity spanning 3 epochs, a block is then authored with a slot belonging to epoch 6. In order to author this block, the client should re-use the data that was previously published for epoch 4.

andresilva commented 1 year ago

This problem is also described by the Ouroboros team in this paper in chapter 2. They describe a manual mitigation which is the same as we have employed in early 2020 on Kusama (we never had to do it on Polkadot mainnet). They do not provide any description of an automatic mitigation method though.

Noc2 commented 1 year ago

Thanks a lot for creating the issue! We will look into it as soon as possible.

hndnklnc commented 1 year ago

Do we need a solution for only one empty epoch or more than one sequential epochs?

tomaka commented 1 year ago

Does the first header after a period of inactivity contain an epoch change digest? My understanding is yes.

For example, let's say that we're in the middle of epoch 3, and all of a sudden all the validators die. Two weeks later, they come back online. Is it correct that the first authored block is in epoch 9 (or whatever the number is) and announces epoch 10 (or whatever the number is)? And epoch 9 uses the authorities and randomness of epoch 4?

andresilva commented 1 year ago

@tomaka Everything you said is correct.

@hndnklnc Yes, we need to be able to recover from any kind of disaster scenario. The most likely scenario for this to happen is due to a client bug that prevents block production, and e.g. with epochs on Polkadot being 6 hours, it's not unlikely that a fix would take longer than this time to be developed and rolled out.

Not sure if it's obvious but this issue is about updating the spec to reflect the solution that was already implemented and deployed. AFAIK the current SASSAFRAS implementation is also employing the same solution, so if there is a better solution to handle this problem I suggest we use it there.

hndnklnc commented 1 year ago

I think the current solution makes sense as far as I understand. I just want to verify this: Assume that epochs from epoch e+1 to e+n are skipped. The validators in epoch e + n +1 use the randomness from epoch e-1, right?

andresilva commented 1 year ago

I think we might be using slightly different definitions (i.e. randomness collection vs announcement), let me expand on your example. If epochs E+1 to E+n are skipped then the last block was authored at epoch E, at which point the randomness to be used in E+1 had already been announced. When coming back online, the validators at epoch E+n+1 will use the randomness that was meant to be used in epoch E+1 (which was collected during epoch E-1).

bhargavbh commented 1 year ago

Hi @andresilva thanks for pointing this out. I created a short PR addressing it here https://github.com/w3f/polkadot-spec/pull/685, can you please take a look.

tomaka commented 1 year ago

For what it's worth, the approach I'm going to take in smoldot is to completely remove the concept of epoch_index from the client side. The transcript instead contains (slot_of_block - first_slot_ever) / slots_per_epoch.

It's not a bad thing that on the runtime side the epoch index is still tracked in case it is needed in the future, but on the client side there really isn't any need for this concept anymore.

IMO the modification to the spec should follow a similar approach, because adding a little addendum somewhere creates confusion more than anything else.

tomaka commented 1 year ago

Thinking about it more, the interface between the runtime and the client is part of the spec as well and returns an epoch index. It feels weird to not mention that the epoch index returned by the runtime isn't at least partially tied to the actual epoch index calculated in the transcript.

So basically I'm opinion-less.

andresilva commented 1 year ago

@tomaka I agree that as it stands the epoch index is superflous since it can be trivially calculated. I think this is mainly due to the fact that right now we are relying on unix time to calculate the current slot rather than the "median algorithm". Also for the reason you mentioned I think we should keep the epoch index mention in the spec (regardless of whether client implementations actually need to keep track of it or just calculate it on-the-fly).

andresilva commented 1 year ago

Closing after #685.

tomaka commented 1 year ago

For what it's worth, the approach I'm going to take in smoldot is to completely remove the concept of epoch_index from the client side. The transcript instead contains (slot_of_block - first_slot_ever) / slots_per_epoch.

For future reference, tracking only the "first slot ever" has a drawback, which is that it's not possible to detect when a block contains an epoch change when it's not supposed to.

This can be solved by tracking both the first slot ever and also the starting slot of the current epoch. But in the end I've kept the concept of epoch_index in smoldot, as it didn't require a lot of code changes.

w3f / polkadot-spec

BABE skipped epochs #681