Open AlexHentschel opened 1 month ago
I think it would improve clarity if we moved the logic for ejection into a dedicated subcomponent. Initializing the lookups is done in several places without any need on the hot path and tracking that we really update it everywhere correctly is highly non-trivial.
Sketch of suggested approach:
introduce the auxiliary component ejector
:
type trackedDynamicIdentityList struct {
dynamicIdentites flow.DynamicIdentityEntryList
identityLookup map[flow.Identifier]*flow.DynamicIdentityEntry // dynamicIdentites.Lookup() LAZY initialized
}
type ejector struct {
identityLists []trackedDynamicIdentityList
ejected []flow.Identifier
}
func (e *ejector) Eject(nodeID flow.Identifier) {
l := len(e.identityLists)
if len(e.ejected) == 0 { // if this is the first ejection sealed in this block, we have to populate the lookup first
for i:= 0; i < l; i++ {
e.identityLists[i].identityLookup = e.identityLists[i].dynamicIdentites.Lookup()
}
}
for i:= 0; i < l; i++ {
dynamicIdentity, found := e.identityLists[i].identityLookup[nodeID]
if found {
dynamicIdentity.Ejected = true
}
}
}
func (e *ejector) TrackDynamicIdentityList(list flow.DynamicIdentityEntryList) error {
tracker := trackedDynamicIdentityList{dynamicIdentites: list}
if len(e.ejected) > 0 {
// nodes were already ejected in this block, so their ejection should not be reverted by in the new `list`
tracker.identityLookup = list.Lookup()
for _, id := range e.ejected {
dynamicIdentity, found := tracker.identityLookup[id]
if found && !dynamicIdentity.Ejected {
return sentinelErrorf("node %v was previously ejected but next DynamicIdentityList reverts their ejection status", id)
}
}
}
e.identityLists = append(e.identityLists, tracker)
return nil
}
thereby we can remove the lookups from the baseStateMachine
, because the ejector
now takes care of it
The ejector is more efficient (will never populate any lookups without ejection) and the ejector
is much stronger encapsulated (none of the state machines need to worry about ejection and lookup).
TODO from PR #6079 regarding fallback statemachine test TestProcesingMultipleEventsAtTheSameBlock
(see this comment)
[Alex] could you please add ejection events here too? I think we have uncovered edge cases, where we should reject a recovery event that would be re-emitting an ejected node. Though, I think the current implementation would accept this edge case.
You could add this as a ToDo to https://github.com/onflow/flow-go/issues/6019. It would be more or less a test verifying that the current implementation has problems. My gut feeling is that the use if
partentState
causes the problems (not 100 sure that there is really an uncovered edge case through)[Yurii] I have added some stuff in 6e09f32 (#6079) but it will require extra love when implementing 6019. But you are right, currently there is an edge case and epoch recover event is accepted even though it readmits an ejected entity. We actually end up in broken state where current epoch has an entity ejected and in next epoch it's not.
Problem Definition
We have different state machines for orchestrating Epoch transitions in the package
state/protocol/protocol_state/epochs
. The current implementation use theparentState
in their business logic in numerous places.In my opinion, this is extremely error prone. To the best of my understanding, the current code has at least one security-critical bug related to this here. Specifically,
buildNextEpochActiveParticipants
has a check to enforce that already ejected nodes are not re-admitted. For the current logic, two service eventseject node X
followed by aEpoch Setup
in the same block could be a blind spot for this check. Because we always compare to the parent state (where ejection has not jet been applied). The current code would accept anEpoch Setup
, where nodeX
is still listed as a participant and override the prior ejection.In summary, when reviewing the state machines in the package
epochs
there are various places where I could not convince myself of the correctness of the code, because the parent state was just used without clear documentation why this is safe.Context
I think this design flaw is in part originating from our separation of
ProtocolStateEntry
andRichProtocolStateEntry
:RichProtocolStateEntry
, the evolved protocol state is only maintained asProtocolStateEntry
.CurrentEpochIdentityTable
andNextEpochIdentityTable
throughout the epoch state evolution. As those fields are part of theRichProtocolStateEntry
, we only work with theProtocolStateEntry
when evolving the state.EpochSetup
andEpochCommit
events. Whenever we need those, we need to go back to the parent state. And this is where the confusing mix of using the evolving state as well as the unevolved parents originates in my opinion.RichProtocolStateEntry
, because they want the service events though don't care about the Identity tables. Thereby we have further instances, where we might erroneously use theparentState
instead of the updatedstate
Goal
The goal of this issue is three-fold:
parentState
as possible. In my opinion, the only place should be the state machine constructor, where we copy the parent.parentState
, we must include a detailed comment explaining why this is safe even in the case of prior ejection events sealed in the current block.HappyPathStateMachine
as well as theFallbackStateMachine
that confirm the correct behaviour in the following case:eject node X
followed by aEpoch Setup
Epoch Setup
lists nodeX
participant, which would illegally override the prior ejection. As the smart contract emitted the ejection event before, it should account for this and not includeX
in the setup.suggestion
I would suggest to add an interim struct between
ProtocolStateEntry
andRichProtocolStateEntry
. In more detail:ProtocolStateEntry
toEpochMinStateEntry
add struct
EpochStateEntry
:rename
RichProtocolStateEntry
toEpochRichStateEntry
and base it onEpochStateEntry
EpochMinStateEntry
andEpochRichStateEntry
(as before)EpochMinStateEntry
plus the actual service events would useEpochStateEntry
. In particular, this would be the state machines. Thereby, we can cleanly avoid any usage of theparentState
in the logic that processes service events.