ExEx pipeline batch ETL

frisitano commented 1 week ago

Describe the feature

In certain contexts an ExEx requires access to comprehensive data from all pipeline stages including hashed state, merkle, transaction lookup and history. Currently the pipeline runs stages serially meaning that when an ExExEvent is emitted from the execution stage the ExEx will not have access to this comprehensive data. To address this issue I propose that we introduce a new operational mode. In this operational mode the chain pipeline sync will be batched such that all stages of the pipeline are run for a batch of a configurable number of n blocks, the pipeline will then wait for the finished_height of the ExExManager to reach the end of the batch before proceeding to the next batch. This mode will be optional and as such it should not impact the standard sync mechanics which are currently in place. I can put together a draft PR for this if there is support for this feature.

Additional context

No response

onbjerg commented 1 week ago

Can you elaborate a bit more on a use case, and how you would tackle implementing this? As I see it, we already send all of the data in the event, excluding merkle data, from the pipeline, so this seems like a lot of complexity added on top. Furthermore, in order to send all of this data in an event, the stages would individually need to hold on to the data they generated in a batch in memory until the entire pipeline finishes, which seems very expensive.

frisitano commented 1 week ago

In terms of the use case. I want to build an ExEx that can generate block traces which can be used as input for the polygon type 1 zkEVM prover. This would enable us to use reth as a sequencer node for a type 1 zkEVM rollup. A component of the block trace is the state witness. The state witness includes Merkle witness data for all state read / writes which occur during block execution. As such, we require Merkle data to generate this witness.

I would suggest that to keep the surface area of this feature minimised we do not change the event payload or the way in which the stages operate. If an ExEx needs to access any of this additional data then they can use the provider api to access it (self.ctx.provider().state_by_block_hash(block_hash)) which I believe uses change sets under the hood to construct the historical state.

There are a two core considerations for the design:

emit events after all stages of the pipeline has run. This can be achieved with a post pipeline hook on the execution stage (currently we use a post stage hook).
allow back pressure from the ExEx such that the chain tip does not outpace the ExEx processing requiring progressively more change sets to construct historical state. This can be done by polling the ExEx manager to get the finished_height before proceeding to the next batch.

Thinking about this more, it could make more sense to have the option to disable the pipeline completely and only use the blockchain tree to sync for this particular use case instead of trying to make changes to the pipeline.

paradigmxyz / reth

ExEx pipeline batch ETL #8917

Describe the feature

Additional context