onflow / flow-go

A fast, secure, and developer-friendly blockchain built to support the next generation of games, apps, and the digital assets that power them.
GNU Affero General Public License v3.0
531 stars 171 forks source link

[EFM | M2] Epoch State Machines should not use `parentState` in their business logic #6019

Open AlexHentschel opened 1 month ago

AlexHentschel commented 1 month ago

Problem Definition

We have different state machines for orchestrating Epoch transitions in the package state/protocol/protocol_state/epochs. The current implementation use the parentState in their business logic in numerous places.

In my opinion, this is extremely error prone. To the best of my understanding, the current code has at least one security-critical bug related to this here. Specifically, buildNextEpochActiveParticipants has a check to enforce that already ejected nodes are not re-admitted. For the current logic, two service events eject node X followed by a Epoch Setup in the same block could be a blind spot for this check. Because we always compare to the parent state (where ejection has not jet been applied). The current code would accept an Epoch Setup, where node X is still listed as a participant and override the prior ejection.

In summary, when reviewing the state machines in the package epochs there are various places where I could not convince myself of the correctness of the code, because the parent state was just used without clear documentation why this is safe.

Context

I think this design flaw is in part originating from our separation of ProtocolStateEntry and RichProtocolStateEntry:

Goal

The goal of this issue is three-fold:

  1. I would be inclined to remove as many usages of parentState as possible. In my opinion, the only place should be the state machine constructor, where we copy the parent.
  2. In all remaining places where we cannot avoid using parentState, we must include a detailed comment explaining why this is safe even in the case of prior ejection events sealed in the current block.
  3. include tests for the HappyPathStateMachine as well as the FallbackStateMachine that confirm the correct behaviour in the following case:
    • we have one block sealing two service events: eject node X followed by a Epoch Setup
    • Epoch Setup lists node X participant, which would illegally override the prior ejection. As the smart contract emitted the ejection event before, it should account for this and not include X in the setup.
    • Encountering such a service event should trigger EFM.

suggestion

I would suggest to add an interim struct between ProtocolStateEntry and RichProtocolStateEntry. In more detail:

AlexHentschel commented 1 month ago

Reminder to myself:

I think it would improve clarity if we moved the logic for ejection into a dedicated subcomponent. Initializing the lookups is done in several places without any need on the hot path and tracking that we really update it everywhere correctly is highly non-trivial.

Sketch of suggested approach:

introduce the auxiliary component ejector :


type trackedDynamicIdentityList struct {
    dynamicIdentites flow.DynamicIdentityEntryList
    identityLookup   map[flow.Identifier]*flow.DynamicIdentityEntry // dynamicIdentites.Lookup() LAZY initialized
}

type ejector struct {
    identityLists []trackedDynamicIdentityList
    ejected       []flow.Identifier
}

func (e *ejector) Eject(nodeID flow.Identifier) {
    l := len(e.identityLists)
    if len(e.ejected) == 0 { // if this is the first ejection sealed in this block, we have to populate the lookup first
        for i:= 0; i < l; i++ {
            e.identityLists[i].identityLookup = e.identityLists[i].dynamicIdentites.Lookup()
        }
    }

    for i:= 0; i < l; i++ {
        dynamicIdentity, found := e.identityLists[i].identityLookup[nodeID]
        if found {
            dynamicIdentity.Ejected = true
        }
    }
}

func (e *ejector) TrackDynamicIdentityList(list flow.DynamicIdentityEntryList) error {
    tracker := trackedDynamicIdentityList{dynamicIdentites: list}
    if len(e.ejected) > 0 {
        // nodes were already ejected in this block, so their ejection should not be reverted by in the new `list`
        tracker.identityLookup = list.Lookup()
        for _, id := range e.ejected {
            dynamicIdentity, found := tracker.identityLookup[id]
            if found && !dynamicIdentity.Ejected { 
                return sentinelErrorf("node %v was previously ejected but next DynamicIdentityList reverts their ejection status", id)
            }
        }
    }
    e.identityLists = append(e.identityLists, tracker)
    return nil
}

thereby we can remove the lookups from the baseStateMachine, because the ejector now takes care of it

The ejector is more efficient (will never populate any lookups without ejection) and the ejector is much stronger encapsulated (none of the state machines need to worry about ejection and lookup).

AlexHentschel commented 2 weeks ago

TODO from PR #6079 regarding fallback statemachine test TestProcesingMultipleEventsAtTheSameBlock (see this comment)

[Alex] could you please add ejection events here too? I think we have uncovered edge cases, where we should reject a recovery event that would be re-emitting an ejected node. Though, I think the current implementation would accept this edge case.

You could add this as a ToDo to https://github.com/onflow/flow-go/issues/6019. It would be more or less a test verifying that the current implementation has problems. My gut feeling is that the use if partentState causes the problems (not 100 sure that there is really an uncovered edge case through)

[Yurii] I have added some stuff in 6e09f32 (#6079) but it will require extra love when implementing 6019. But you are right, currently there is an edge case and epoch recover event is accepted even though it readmits an ejected entity. We actually end up in broken state where current epoch has an entity ejected and in next epoch it's not.