Open eskimor opened 1 year ago
As @AlistairStewart and I both said before, we should never reduce the number of blocks in an epoch, but instead either (a) make header only blocks, or else (b) make epochs last longer but have the same number of blocks.
We should discuss if we need consensus to add time between blocks ala (b). We could anyways provide finality on-chain using beefy, so then block rate gets determines by when we last put on-chain some beefy finality proof. We do not afaik need consensus for (a) meaning we make header only blocks ourselves and do not build on blocks with bodies.
Aside from auto reversion, we've several bad soundness solutions:
We do have one other viable seemingly soundness solution:
Auto reversion sounds much easier to analyze and minimizes unused code paths. Afaik (4) sounds nicer politically, but it requires quite careful analysis, and careful agreement between the analysis and all code path, so this makes doing (4) quite tricky.
As @AlistairStewart and I both said before, we should never reduce the number of blocks in an epoch, but instead either (a) make header only blocks, or else (b) make epochs last longer but have the same number of blocks.
But we are doing this already: This is how reverts work. The blocks already exist, we cannot make them empty after the fact.
Yes of course, given reverts have a higher priority, but yes we should be careful about how cheap reverts become elsewhere.
Update: Back-off has been removed, so reverting the chain at some point also makes sense to prevent excessive memory usage due to an ever growing unfinalized chain.
I'll think about this, but yeah we might not have any choice.
cc @AlistairStewart
Reverting the chain doesn't affect the number of unfinalised blocks we have to deal with. That still grows by 1 block every 6 seconds and any one if them might be finalised.
Reverting the chain doesn't affect the number of unfinalised blocks we have to deal with. That still grows by 1 block every 6 seconds and any one if them might be finalised.
Why would that be? But I guess, there are subtleties on when we consider a fork dead. Especially if finality is still not moving after the revert. The implicit assumption here is that finality will work again, once we reverted ... which might be true most of the time, but if not? Then we would just keep building very long forks ... not sure when anyone of them gets pruned. This is something that needs to be checked.
We force prune currently after 4096
blocks. (which is some random number and more a technical limitation of the current implementation)
We force prune currently after 4096 blocks. (which is some random number and more a technical limitation of the current implementation)
@andresilva told me that recently, but not sure how this relates to forks. What does "force prune after 4096 blocks" mean when we have forks? Let's say we have ten forks, each with 410 blocks ... are we pruning now? And if so, what?
But what I was actually wondering about, was something else. Which is whether that "dead" long fork would stay around, but actually given that we are telling substrate about viable leaves (which exclude the reverted branch) ... I would assume that this fork gets pruned. But then, I really don't get what @AlistairStewart wants to say above.
Reverting the chain doesn't affect the number of unfinalised blocks we have to deal with. That still grows by 1 block every 6 seconds and any one if them might be finalised.
In my mind we would prune the fork, once we decided to revert it. If then, because other nodes have not come to the same conclusion, but instead were able to finalize that fork, I would expect us to treat it like "finality trumps reversion" and we would resync the chain, we pruned already.
@andresilva told me that recently, but not sure how this relates to forks. What does "force prune after 4096 blocks" mean when we have forks? Let's say we have ten forks, each with 410 blocks ... are we pruning now? And if so, what?
Force pruning/canonicalization means that we will prune all forks that are below the canonicalized block. We prune all forks that are not reachable from what the node sees as best chain (longest chain).
What does "force prune after 4096 blocks" mean when we have forks?
We require that babe/sassafras alone yields probabalistic finality under longest chain within one epoch, which imposes upon us some minimum epoch length. We compute the relationship in basically the same way Cardano does, which gets discussed in the babe writeups and the sassafras paper.
How does reverting end up using different randomness?
As @AlistairStewart and I both said before, we should never reduce the number of blocks in an epoch, but instead either (a) make header only blocks, or else (b) make epochs last longer but have the same number of blocks.
But we are doing this already: This is how reverts work. The blocks already exist, we cannot make them empty after the fact.
The way I understood it is we shouldn't have something like the current back-off strategy where some slots are not even claimed hence reducing the number of blocks in an epoch. Probably mainly due to randomness abuse.
Reverts seems less problematic here because they not necessarily reduce the number of blocks in an epoch but well... somewhat restart the epoch so we need to rebuild them again but properly this time.
Let me get a few things straight:
The problems:
3.
is a claim that is significantly stronger than 2.
2.
is guaranteed but 3.
is not during a grandpa finality lag.So is a block at depth 3k final or not? 2.
does not yet treat it as such but 3.
does...
Both 2.
and 3.
generally address the same fundamental issue under the hood. They are plan Bs for a huge finality lags but simply (and arbitrarily) around a different depths 4k vs 2.4k.
@eskimor suggests that if finality lags so much we should not revert to the classical blockchain ways of longest chain but revert the chain itself. This solution would be anchored around the assumption of 3.
but since 3.
is stronger than 2.
that would mean we would never encounter the logic of 2.
making it effectively redundant since it takes effect at higher depths.
Whatever solution we pick there are 2 dimensions of freedom:
Everything else is a consequence of those 2 choices.
Exemplary solutions:
Based on what @burdges says:
We require that babe/sassafras alone yields probabalistic finality under longest chain within one epoch, which imposes upon us some minimum epoch length.
The 4k canonisation depth is actually arbitrary and it should be set to the session (epoch) length as shown in solution 2.
. Then all assumptions are held and no code changes are needed anywhere.
Reverting the chain doesn't affect the number of unfinalised blocks we have to deal with. That still grows by 1 block every 6 seconds and any one if them might be finalised.
The reverted blocks can still get somehow finalized? Should't we prune it once we revert it? Only issue I see is that we need to keep some of the data to fight against some intentional forking and using reverts to hide unsuccessful attack attempts.
I don't think this question has been answered and it's very relevant:
whether that "dead" long fork would stay around
Reverts seems less problematic here because they not necessarily reduce the number of blocks in an epoch but well... somewhat restart the epoch so we need to rebuild them again but properly this time.
It does. epochs are based on times, hence if we revert (and rebuild new blocks, one per slot) we will always end up with less blocks in that epoch.
endless loop of reversions
The way I described it above, we would make at least one block progress.
The issue with assuming canonicalization is parachain consensus. If finality is lagging because things are not getting approved, we have to assume an attack. If we would then give up after some time and just finalize, we just made a DoS attack an attack on security and soundness. This is where the suggestion to revert and to build one block without any parachain data comes from. Then we have at least one block (in each session) which can not get reverted. What could still happen is that in case of a network split, we later find a better fork and abandon our fork. I guess, in that case it could still happen that we have two forks with the same session index pointing to different session data.
It does. epochs are based on times, hence if we revert (and rebuild new blocks, one per slot) we will always end up with less blocks in that epoch.
Element conversation:
Does that mean that if we revert half a session worth of blocks that session will only ever have at most half a session worth of blocks? What if we revert just before the end of a session and we revert to the beginning of that session? Does the session end nearly immediately after the revert?
yes and yes.
The way I described it above, we would make at least one block progress.
One block of progress is somewhat misleading (I think?) as you also say that:
This means if we revert back to our current safe reversion point, the block that is being built on top of it will be empty (from the perspective of parachains) and thus our current safe reversion point advanced one block.
The block being built to "progress" will be empty. It does not meaningfully transition the state machine and only pushes the number 1 higher.
Endless loop revert loop has 2 variants:
If we use the revert variant you propose:
2.
node side issues can easily be patched and as soon as node operators update the chain will stop auto reverting and finally proceed as normal.
'1.' runtime issues might be a bit more problematic since I'm not sure we can update the runtime within the scope of a single session and if we don't then we just revert again. We also cannot really depend on the extra blocks we make once per revert since they are empty and most likely will not help us remedy the situation. We will have an illusion of progress (empty blocks once in a while) but we still will perpetually revert until we hard fork.
'1.' and '2.' are same as just '1.' since this problem is bigger.
If we would then give up after some time and just finalize, we just made a DoS attack an attack on security and soundness.
I agree that this is a problem. And especially so considering we have a heavily bounded number of validators in the active set so DoSing them is an actual possibility (at least much easier as compared to sth like Bitcoin). Although looking from that perspective your claim isn't really to fix probabilistic soundness/finality. It is to literally remove probabilistic finality and only depend on guaranteed grandpa finality as you believe we cannot trust any other finalization mechanisms. This is the core of the issue here.
@AlistairStewart has is the canonisation depth at 4k safe considering that all they need to do to compromise security/soundness is DoS/delay approvals for around 6h?
Re: call with @eskimor
Maybe consider a hybrid approach when under some circumstances (approvals not completing) we use the revert tactic but otherwise default to probabilistic finality and forced canonisation even with finality stalls. This might over the DoS protection we're looking for with the flexibility of probabilistic finality in most other cases. More analysis needed.
Another ingredient to consider: Being able to cache SessionInfo by index is kind of required to make disputes for example self-contained and is also good for performance in general. It would not necessarily need to be an index though. It could also likely be a SessionId
... an identifier, which for example includes the hash of the block that triggered the session change. We would basically have a "fork id" in the session id: SessionId = (SessionIndex, ForkId)
, ForkId="Hash of session change block"
.
Downsides:
SessionId
for this SessionIndex
on the fork of block X
.Questions: Would it really help with the issue? If so, is it worth the trouble? After all we would add quite a bit of overhead and also code complexity for covering a case that is likely to almost never occur in practice on a production network. Never say never of course, but if it is rare enough, solutions like just hard forking, patching node, governance, ... etc. can be acceptable.
New idea:
Came up at the Chains retreat. A bit more serious as this napkin won't make sense to anybody. We should be able to fix soundness (and trade it for liveness) by using on-chain finality proofs.
What we can do with finality proofs:
No forced reversions are necessary.
Slow down block production again in case of a finality lag. This is good to avoid validators running out of memory and other issues.
If we put finality on-chain then we could do slow down block production, but our semantics remain tricky. Do we stretch the epoc, breaking DevX/UX? Do we cut slots, breaking our analysis?
It'll be easier if honest nodes making empty blocks suffices, which avoids requiring concensus. It might not suffice of course.
While it is true that sessions keep progressing according to time, it does not mean we have to produce blocks (which is what is causing issues).
This remains unclear. Why is making relay chain blocks the issue? Isn't it more likely parachain blocks? Or some DoS attack?
Is PVF availability the real topic here?
Producing a block in session N+1 would only be valid if a finality proof for the the block providing the randomness for that session is (made) available.
It's unclear what this means, since it's a past relay chain block you're building upon. It fixes the randomness by virtue of it's place in the past.
Praos-like protocols like Babe and Sassafras already yield secure randomness, even without a finality gadget like grandpa, but this analysis requires you do not slow the chain down too much. That's part of why I've always favored empty blocks.
Afaik off-chain PVF availability would not be ensured by Praos though, but this doesn't prevent us arguing it in some other way.
It's anyways elves/approvals that deviates from being Praos-like for us, so here we're likely only discussing approvals being slow, likely from no-shows or disputes. A 1/3rd adversary who no-shows everywhere basically escalates everything.
We've two or three "free" backoff locations: (1) We're already doing some backoff by honest backer. We could've backoff by honest relay chain block producers too, likely by (2) dropping backing statments, but maybe also by (3) delaying availability finishing, ala inclusion.
A priori we'd backoff as much as 8/9ths, given by 1/3rd of dishonest backers from (1), and their blocks being on-chan maybe 1/3rd of the time from (2). This maybe enough!
In Babe we assume probabilistic finality after one session. In fact we do assume this kind of finality in a couple of places elsewhere as well:
Usages of Probabilistic Finality
Configuration Changes
We buffer configuration changes and apply them at the next session. The logic as described here ensures that configuration changes are always buffered for at least a full session. Assuming probabilistic finality this ensures that for a given
SessionIndex
you will always get the same configuration.Session Information
Similar to configuration, other information like the current validator set, are predetermined a session ahead of time, with the assumption that the set is then indeed uniquely identified by the
SessionIndex
.When constructing a
SessionInfo
on a session boundary, we rely on randomness of one epoch ago. This way, assuming probabilistic finality we will always arrive at the sameSessionInfo
.Expected PVF existence
For executing PoVs we assume the PVF to be present on any current chain. This is also based on the assumption of probabilistic finality and is the reason we delay the usage of a PVF for at least one complete session. If this did not hold, validators would not be able to participate in a dispute for example.
The Problem
While we allow disputes for a dispute window of 6 sessions, reverting that long is absolutely not sound! For all the reasons noted above and even if you think about the dispute itself: We want to be able to slash validators, now if we just revert long enough we might be forced to revert to a point just before the validator election, with the to be slashed validators not even existing on chain: We would not be able to slash at all! This is just one extreme example to illustrate how bad this really is, but in general the probabilistic finality assumption is backed in quite deeply in a lot of places, if it is broken we lose the ability to reason about the chain.
Actual Soundness Issues
Now all of this might sound rather theoretical, but we had actual soundness issues in the wild: We found on a test net that signature checks suddenly failed: Signatures by validators became invalid, although they have been valid before! This actually happened with chain reversions on a test network. How was this possible?
SessionInfo
, with a differently (shuffled) validator set then the last time we entered that session.Granted given the long session time we have on production network, this is unlikely to ever happen in production. But with chain reversions it is actually possible, assuming some issue or an attack: An attacker can make us lose soundness, by just DoSing long enough. Where the time needed is exactly known in advance - only one hour on Kusama!
The question is, can we do better? Indeed we can: We can make the issue less severe and trade soundness for liveness, which is a good deal in general, but in this particular case even more so:
Solution
Problem Statement
Losing soundness due to some bug or an attack is not a good position to be in. What can we do about it? First we have to realize the prerequisites:
If you consider (1) you will realize that we already have a liveness issue: Block production already slowed down a lot, so reverting a session worth of block for example would not actually be that much reverted work and this already leads to the solution I am proposing:
We assume probabilistic finality after one session, yet we allow chain reversions of a larger depth - essentially ruining our probabilistic finality assumption. So what can we do about it?
We need to ensure it holds.
Let's assume the following situation:
The ^ marks the longest chain that can be assume "reversion safe", for now let's assume this is the same as finality as in GRANDPA.
Now what property do we need to maintain? Once we are in session
N
(past the first|
) we would be using e.g. randomness and configurations set in sessionN - 1
for creating sessionN + 1
. SessionN
is not affected directly by sessionN - 1
. Hence it is safe to start producing blocks in sessionN
, even though we might end up reverting back into sessionN - 1
. What would not be safe, if we were already in sessionN + 1
and reverted back to sessionN - 1
as then the randomness of sessionN - 1
and configuration would have been used already. If we reverted then, different randomness would be used. In other words, it is safe to revert a session boundary, it is not safe to revert two or again put differently: We cannot revert more than a complete session.In the above illustration, the current safe reversion point is in session
N-1
, this means crossing the session boundary onN+1
marked with the exclamation mark!
is not safe! Hence we should not! But session changes are determined based on time. So how can we not move to the next session? The answer is, we can not, we have to do the session change, but we can do it in a way that is safe!If we find our self in the situation that the child of the current block would already be in session
N+1
, but our current safe reversion point is still in sessionN-1
, we need to build that child block on our current safe reversion point! In other words, before entering sessionN+1
we have to ensure that there does not exist any block that could be later on reverted due to a dispute, we do that via a preemptive reversion to our safe point, resulting in:Assuming finality is still halted, this can continue to happen again and again. Everytime we would become unsound, we revert and by this maintaining soundness. If you look closely at the previous graph you will notice that the chain is still progressing:
Ensuring Chain Progress / Liveness Considerations
Early reversion might sound problematic from a liveness perspective, but is actually less of a problem as it might appear initially. First as mentioned previously, with such a long finality stall block production will be very slow, so the actual amount of work being reverted is not that much. Second, the reversion technique still guarantees that the chain will make progress. In particular it is ensured that the chain is making at least one block per session.
This is, because the current safe reversion point is in fact not just the finalized chain, but any tip of the chain which only contains blocks which are either finalized or empty as far as parachain consensus is concerned: If there are no candidates included in a chain, it can also never reverted by a dispute. This means if we revert back to our current safe reversion point, the block that is being built on top of it will be empty (from the perspective of parachains) and thus our current safe reversion point advanced one block.
Session Changes on Reversion
If you look at the above graphs, you will realize that the block being built on top of the current safe point will be still in session
N-1
in parachain consensus. This is because we predetermine the session of the next block in advance. The new session change to sessionN
will happen in the next block. More details are very relevant to completely reason about this, inquiry pending.Implementation Considerations
Depending on what we will learn here, the suggested logic should be implementable purely on the node side. No runtime changes should be needed. Argument for the base case (argument for repeated reversions following, once question on session changes is resolved):
The decision to revert happens right at the session change to session
N + 1
, based on the finality state at the session boundary to sessionN
... so a full session away. It is reasonable to expect a super majority to be in sync on the finality state of a block that old. More importantly, assuming finality is actually lagging that much, it is impossible for any honest node to come to the conclusion that it is fine to continue building on that chain, as the needed finality votes don't exist! Nor would any honest node import such a block, hence for all intents and purposes such a soundness breaking block can not exist with this reversion rule in place.The worst case for a node side implementation would be that finality is catching up exactly at the time where we decide to revert, resulting in a race. We will definitely need to check that the reversion logic handles the case well: If a reverted chain gets finalized, the reversion should be ignored. For this argument let's assume this is the case, what scenarios can occur:
In all those cases, including the worst case (1), things should work out just fine if finality takes precedence over reversion. The fork based on the reverted chain will simply get abandoned, because if fell behind finality.
Implementation Steps
session_index_for_child()
indicates another session change. It is important to do this race free before authoring a block. E.g. just having the dispute coordinator sendRevertBlocks
, would be insufficient as if that message ran late, we could still be producing that unsound block.