[EFM | M2] Epoch State Machines should not use `parentState` in their business logic

Problem Definition

We have different state machines for orchestrating Epoch transitions in the package state/protocol/protocol_state/epochs. The current implementation use the parentState in their business logic in numerous places.

In my opinion, this is extremely error prone. To the best of my understanding, the current code has at least one security-critical bug related to this here. Specifically, buildNextEpochActiveParticipants has a check to enforce that already ejected nodes are not re-admitted. For the current logic, two service events eject node X followed by a Epoch Setup in the same block could be a blind spot for this check. Because we always compare to the parent state (where ejection has not jet been applied). The current code would accept an Epoch Setup, where node X is still listed as a participant and override the prior ejection.

In summary, when reviewing the state machines in the package epochs there are various places where I could not convince myself of the correctness of the code, because the parent state was just used without clear documentation why this is safe.

Context

I think this design flaw is in part originating from our separation of ProtocolStateEntry and RichProtocolStateEntry:

While the parent state is available as RichProtocolStateEntry, the evolved protocol state is only maintained as ProtocolStateEntry.
- This is on purpose, because we don't want to deal with the complexity of also maintaining the CurrentEpochIdentityTable and NextEpochIdentityTable throughout the epoch state evolution. As those fields are part of the RichProtocolStateEntry, we only work with the ProtocolStateEntry when evolving the state.
- Unfortunately, this has the undesired side effect that the evolving state has reference to the full EpochSetup and EpochCommit events. Whenever we need those, we need to go back to the parent state. And this is where the confusing mix of using the evolving state as well as the unevolved parents originates in my opinion.
Furthermore, we have various service methods that work on the RichProtocolStateEntry, because they want the service events though don't care about the Identity tables. Thereby we have further instances, where we might erroneously use the parentState instead of the updated state

Goal

The goal of this issue is three-fold:

I would be inclined to remove as many usages of parentState as possible. In my opinion, the only place should be the state machine constructor, where we copy the parent.
In all remaining places where we cannot avoid using parentState, we must include a detailed comment explaining why this is safe even in the case of prior ejection events sealed in the current block.
include tests for the HappyPathStateMachine as well as the FallbackStateMachine that confirm the correct behaviour in the following case:
- we have one block sealing two service events: eject node X followed by a Epoch Setup
- Epoch Setup lists node X participant, which would illegally override the prior ejection. As the smart contract emitted the ejection event before, it should account for this and not include X in the setup.
- Encountering such a service event should trigger EFM.

suggestion

I would suggest to add an interim struct between ProtocolStateEntry and RichProtocolStateEntry. In more detail:

rename ProtocolStateEntry to EpochMinStateEntry

add struct EpochStateEntry:

type EpochStateEntry struct {
*EpochMinStateEntry

PreviousEpochSetup        *EpochSetup
PreviousEpochCommit       *EpochCommit
CurrentEpochSetup         *EpochSetup
CurrentEpochCommit        *EpochCommit
NextEpochSetup            *EpochSetup
NextEpochCommit           *EpochCommit
}

rename RichProtocolStateEntry to EpochRichStateEntry and base it on EpochStateEntry

type EpochRichStateEntry struct {
EpochStateEntry

CurrentEpochIdentityTable IdentityList
NextEpochIdentityTable    IdentityList
}

Most of the existing code will only care about EpochMinStateEntry and EpochRichStateEntry (as before)
However, code that needs the EpochMinStateEntry plus the actual service events would use EpochStateEntry. In particular, this would be the state machines. Thereby, we can cleanly avoid any usage of the parentState in the logic that processes service events.

Reminder to myself:

I think it would improve clarity if we moved the logic for ejection into a dedicated subcomponent. Initializing the lookups is done in several places without any need on the hot path and tracking that we really update it everywhere correctly is highly non-trivial.

Sketch of suggested approach:

introduce the auxiliary component ejector :


type trackedDynamicIdentityList struct {
    dynamicIdentites flow.DynamicIdentityEntryList
    identityLookup   map[flow.Identifier]*flow.DynamicIdentityEntry // dynamicIdentites.Lookup() LAZY initialized
}

type ejector struct {
    identityLists []trackedDynamicIdentityList
    ejected       []flow.Identifier
}

func (e *ejector) Eject(nodeID flow.Identifier) {
    l := len(e.identityLists)
    if len(e.ejected) == 0 { // if this is the first ejection sealed in this block, we have to populate the lookup first
        for i:= 0; i < l; i++ {
            e.identityLists[i].identityLookup = e.identityLists[i].dynamicIdentites.Lookup()
        }
    }

    for i:= 0; i < l; i++ {
        dynamicIdentity, found := e.identityLists[i].identityLookup[nodeID]
        if found {
            dynamicIdentity.Ejected = true
        }
    }
}

func (e *ejector) TrackDynamicIdentityList(list flow.DynamicIdentityEntryList) error {
    tracker := trackedDynamicIdentityList{dynamicIdentites: list}
    if len(e.ejected) > 0 {
        // nodes were already ejected in this block, so their ejection should not be reverted by in the new `list`
        tracker.identityLookup = list.Lookup()
        for _, id := range e.ejected {
            dynamicIdentity, found := tracker.identityLookup[id]
            if found && !dynamicIdentity.Ejected { 
                return sentinelErrorf("node %v was previously ejected but next DynamicIdentityList reverts their ejection status", id)
            }
        }
    }
    e.identityLists = append(e.identityLists, tracker)
    return nil
}

thereby we can remove the lookups from the baseStateMachine, because the ejector now takes care of it

The ejector is more efficient (will never populate any lookups without ejection) and the ejector is much stronger encapsulated (none of the state machines need to worry about ejection and lookup).

TODO from PR #6079 regarding fallback statemachine test TestProcesingMultipleEventsAtTheSameBlock (see this comment)

[Alex] could you please add ejection events here too? I think we have uncovered edge cases, where we should reject a recovery event that would be re-emitting an ejected node. Though, I think the current implementation would accept this edge case.

You could add this as a ToDo to https://github.com/onflow/flow-go/issues/6019. It would be more or less a test verifying that the current implementation has problems. My gut feeling is that the use if partentState causes the problems (not 100 sure that there is really an uncovered edge case through)

[Yurii] I have added some stuff in 6e09f32 (#6079) but it will require extra love when implementing 6019. But you are right, currently there is an edge case and epoch recover event is accepted even though it readmits an ejected entity. We actually end up in broken state where current epoch has an entity ejected and in next epoch it's not.

onflow / flow-go