Backwards-compatibility check automation

j1010001 commented 1 month ago

Why (Objective)

To deliver on the promises on Crescendo release (and according to our branching strategy), future Flow releases must be backwards-compatible - this automation will check execution results of Flow pre-release (including Cadence pre-release) agains prod and alert the engineering team if the difference in execution is detected.

How will we measure success (Key Results)

The goal is 0 future releases deployed to prod without going through the compatibility check, re-executing min of 500K blocks.

Effort estimate

TBD - @sjonpaulbrown, @zhangchiqing

DACI

Role	Assigned
Driver	@sjonpaulbrown
Approver	@Kay-Zee
Consulted	@zhangchiqing
Informed	Flow Protocol engineering

sjonpaulbrown commented 1 week ago

@zhangchiqing & @j1010001, as part of a separate initiative, we have developed a batch execution solution that enables us to reliably create VMs & disks from EN disk snapshots. In an on-demand or automated fashion, this enables us to reliably run a batch process over EN data.

To deliver on this work, we are going to need to add support the following

Github workflow that builds the pre-reqs (utils and/or images) and executes a batch process
Create a new bash script that handles the rollback/re-execution of the data inside the batch process

The documentation for the batch process execution can be found here .

turbolent commented 6 days ago

Thank you very much for working on this @sjonpaulbrown! 🙏

I had a look at the part about the scripting – how could we use this to re-execute a certain range of transactions (e.g. a range like "the last 1000 blocks"), twice (current + modified code, e.g. different Cadence version), and compare the results? I guess we would need to still develop this part?

I'm not really familiar with the EN, maybe it would have a way to boot of the state of the disk, and we would build the EN software twice (current + modified), and query it for its execution results?

bluesign commented 4 days ago

I am not sure if booting an EN and running transactions are needed here tbh. It feels like a bit nuclear option.

@zhangchiqing knows better but few options that comes into mind:

fast: simulate a verification node; get chunkdatapacks and protocol state, and verify chunkdatapacks for last N chunks
slower but no dependency: use executionDataAPI ( stream execution data and replay chunks ) (poc: https://github.com/bluesign/replay )

sjonpaulbrown commented 3 days ago

@turbolent & @bluesign, currently, we support this by running an EN in an isolated network that limits external communication over libp2p. @zhangchiqing manually rolls back the transactions and starts the node with the new software to ensure that all blocks are re-executed properly.

The batch job execution is a solution that would be tailored to the current approach that @zhangchiqing has taken. The actual script/steps to re-execute the blocks would need to be developed. Depending on the approach that we take, there would need to be additional work from me to further isolate this job as we are doing today to ensure that no traffic can be sent to the live network.

If there are alternative approaches to re-executing these blocks, @zhangchiqing is the best person to review these.

turbolent commented 3 days ago

@bluesign Thanks for having a look, proposing some options, and even implementing that proof of concept! 🙏

For now we are looking for the simplest solution that is lowest effort to implement and maintain, given the purpose of just being a tool used for releasing new versions of e.g. Cadence and FVM. Resource usage is also a factor in the consideration, but not as important.

I had a look at the replay PoC. Nice work! From what I understand, it

Connects to an AN and subscribes to the execution data
Sets up an FVM
Executes all transactions in the block
Compares execution state writes and events of re-execution to the chunks of the fetched execution data

This is great, because that means we do not need a machine with the execution state, and can just fetch it on-the-fly from an AN. This should be sufficient for our purposes of ensuring backward compatibility of new component versions.

Does this code, especially the FVM configuration, fully match how the EN produces the execution data? Could we maybe reuse more of engine/execution and the fvm packages' code to make sure it matches it exactly and also keep it so.

turbolent commented 3 days ago

@sjonpaulbrown I see, Jan had mentioned Leo's approach of re-executing transactions by removing execution receipts from the EN this morning, which supposedly would then also compare the re-executed results against the real/stored results from the blocks. Is that the case?

bluesign commented 3 days ago

Does this code, especially the FVM configuration, fully match how the EN produces the execution data? Could we maybe reuse more of engine/execution and the fvm packages' code to make sure it matches it exactly and also keep it so.

Yeah I searched that before but somehow they are coming from Nodebuilder mostly I guess, some stuff has defaults, but also some has not good defaults. ( computation limit,memory limit, transaction fees etc )

maybe one option is to fix FVM defaults to more sensible values.

Btw this is a bit poc as I said, but I think this can also work good with storage indexing enabled AN, then it would be super fast.

zhangchiqing commented 2 days ago

@bluesign Thanks for the ideas. I also have some ideas, lets compare them.

Option 1, take recent EN snapshot, and roll back executed height to a past blocks, build new EN image with new cadence version and re-execute all blocks.

This is more of an integration tests that can capture any changes that might cause problem. Since the EN has all the data in local, it is able to re-execute without sending any network reqs.
Option 2, take a recent EN snapshot, build a new util to re-execute the past blocks

This is similar to option 1. The difference is option 1 runs a docker image, and this runs as a util command, which is more lightweight, and easier to add debug logs in case we need to debug. And also easier for JP to integrate into the CI.

However, it requires some effort to make this re-execute block util, which I tried during the evm gateway debugging, however, it produced a different result, I might have to spend some more time to debug that.
Option 3, take recent AN snapshot, which has registers indexed locally, and build a new util to replay all blocks with new cadence version. As bluesign mentioned, this approach will be much slower, however, the advantage is it uses way less memory as it doesn't need to load the entire trie, but uses AN's register store to return registers.
Option 4, take a recent EN snapshot, build a new util that verifies each chunks with the chunk data pack data for past blocks using the new cadence version.

This is a good idea, it also uses way less memory, as the chunk data pack can be used to build a partial trie just like VN does.

Another advantage is, it doesn't depend on checkpoint files, so no need to roll back execution state, like Option 1 and 2 does. Easier to rerun multiple times if needed.

zhangchiqing commented 1 day ago

I implemented Option 4 with @bluesign 's idea.

It was easy to implement, and the result is quite good.

onflow / flow-go