onflow / flow-go

A fast, secure, and developer-friendly blockchain built to support the next generation of games, apps, and the digital assets that power them.
GNU Affero General Public License v3.0
533 stars 179 forks source link

Chunk Data Model supports service event indices #6744

Open jordanschalm opened 1 week ago

jordanschalm commented 1 week ago

This PR adds support for specifying which service events were emitted in which chunk, by modifying the ChunkBody data structure in a backward compatible manner. Addresses #6622.

Changes

Upgrade Notes

Currently, no service events are emitted outside of the system chunk. Let's call the first block at which this happens block X.

This PR replaces two prior approaches, implemented in part https://github.com/onflow/flow-go/pull/6629 and https://github.com/onflow/flow-go/pull/6730.

codecov-commenter commented 1 week ago

Codecov Report

Attention: Patch coverage is 70.14925% with 20 lines in your changes missing coverage. Please review.

Project coverage is 41.73%. Comparing base (d0c9695) to head (718a5ee).

Files with missing lines Patch % Lines
utils/slices/slices.go 0.00% 9 Missing :warning:
utils/unittest/encoding.go 0.00% 5 Missing :warning:
model/flow/chunk.go 81.25% 2 Missing and 1 partial :warning:
module/chunks/chunkVerifier.go 57.14% 2 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## feature/efm-recovery #6744 +/- ## ======================================================== + Coverage 41.70% 41.73% +0.02% ======================================================== Files 2030 2031 +1 Lines 180462 180519 +57 ======================================================== + Hits 75261 75331 +70 + Misses 99003 98996 -7 + Partials 6198 6192 -6 ``` | [Flag](https://app.codecov.io/gh/onflow/flow-go/pull/6744/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=onflow) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/onflow/flow-go/pull/6744/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=onflow) | `41.73% <70.14%> (+0.02%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=onflow#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.


🚨 Try these New Features:

AlexHentschel commented 6 days ago

strategic / conceptual thoughts

  1. I am thinking about enforcing activation of the protocol extension (specifically, the usage of the Chunk.ServiceEventIndices field). For example, it would be nice if consensus nodes could drop Execution Results with the deprecated format and avoid incorporating them into a block (and rejecting blocks that still do).
  2. To ensure consensus incorporates only execution receipts following the new convention after a certain height, it would be great if we could include some consistency check also in the receiptValidator (somewhere around here).
  3. I was wondering if your plan is still to remove Chunk.ServiceEventIndices field for Byzantine Fault Tolerance in the long term? I think we had talked about turning ExecutionResult.ServiceEventList into an indexed list. Or are you thinking about keeping the ServiceEventIndices field as the long-term solution -- just with removing the backwards-compatibility case?

    In general, my preference would be to also allow Chunk.ServiceEventIndices = nil when there are no service events generated in the chunk. Thereby the final solution becomes a lot more intuitive:

    • if ExecutionResult.ServiceEvents is not empty (nil allowed), then the new convention requires for consistency:

      $$\sum_\texttt{Chunk} \texttt{len}(Chunk.ServiceEventIndices) = \texttt{len}(\texttt{ExecutionResult.ServiceEventList})\qquad\qquad\qquad\qquad(1) $$

      I feel is check is very similar to other stuff, which consensus nodes already verify about an Execution Receipt (see ReceiptValidator implementation)

    • We would temporary relax the new convention, eq. $(1)$, for downwards compatibility as follows: the ServiceEventIndices fields of all chunks can be nil despite there being service events.
    • As service events are rare, ExecutionResult.ServiceEventList is empty in the majority cases. Then both the deprecated as well as the new conventions would allow ChunkBody.ServiceEventIndices to be nil (which is the most intuitive convention anyway). Also, for individual chunks that don't produce any service events, their ChunkBody.ServiceEventIndices could be nil or empty according to the new convention. Then, also the new convention is very self-consistent in my opinion - and the depreciation condition is only an add on that can be later removed.

I mean is this is only a temporary solution, I am happy and we can skip most of question 3.

jordanschalm commented 3 days ago

Summary of Discussion with @AlexHentschel

Change of ServiceEventIndices structure

Removing backward-compatibility at next spork

We plan to keep the overall structure as the long term solution, and only remove the backward-compatibility support at the next spork. We do not plan to use a different representation (ie. Chunk.ServiceEventIndices field).

Upgrade Comments

Rough Outline of Process

  1. Do a manual rolling upgrade of all node roles, to a version including feature/efm-recovery.
  2. Manually verify that all ENs are upgraded (this is the only correctness-critical step!)
    • this is necessary prior to emitting the first ex-system-chunk service event
    • ideally, VNs are also updated, but we can rely on emergency sealing as a fallback if necessary
  3. Emit ProtocolVersionUpgrade service event, scheduling the protocol upgrade at view V.
    • Nodes which are not upgraded when we enter view V will halt.
    • We must have a substantial majority (>> supermajority) of SNs and LNs for the network to remain live, before entering view V (this is the only liveness-critical step!)