onflow / flow-evm-gateway

FlowEVM Gateway implements an Ethereum-equivalent JSON-RPC API for EVM clients to use
https://developers.flow.com/evm/about
Apache License 2.0
10 stars 10 forks source link

Gateway crashes while bootstrapping from out of sync AN #488

Open peterargue opened 2 weeks ago

peterargue commented 2 weeks ago

Problem

I am testing the process for running a gateway with a local dedicated access node. When starting the gateway while the Access node is still catching up with the network, it syncs up until the AN's latest indexed block, the panics.

Here's the log output from the GW from my testing

{"level":"info","component":"ingestion","hash":"0xfffc5551f456fce67bdba6c253a26d51bf61f311c3b8361860a4dfcfe6d48c7e","evm-height":2200940,"cadence-height":213377610,"cadence-id":"3ae50bc701976b72430ca702139267e9af7aa8209f851c93befcb1eee409a7a5","parent-hash":"0x7234d6cfe3e85d79f66715aee3db45980952d99ca19da35a9d2e48e4cf9673cf","tx-hashes-root":"0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421","time":"2024-08-28T18:55:42Z","message":"new evm block executed event"}
{"level":"error","error":"failed to create EVM requester: could not fetch the configured COA account: 62631c28c9fc5a91 make sure it exists: client: rpc error: code = OutOfRange desc = failed to get account from the execution node: 3 errors occurred:\n\t* rpc error: code = OutOfRange desc = state for block ID 075346f6d8cc582093e309136241914fa8fbf16af02ddfe08b076b66d00938a6 not available\n\t* rpc error: code = OutOfRange desc = state for block ID 075346f6d8cc582093e309136241914fa8fbf16af02ddfe08b076b66d00938a6 not available\n\t* rpc error: code = OutOfRange desc = state for block ID 075346f6d8cc582093e309136241914fa8fbf16af02ddfe08b076b66d00938a6 not available\n\n","time":"2024-08-28T18:55:42Z","message":"failed to start the API server"}
panic: failed to create EVM requester: could not fetch the configured COA account: 62631c28c9fc5a91 make sure it exists: client: rpc error: code = OutOfRange desc = failed to get account from the execution node: 3 errors occurred:
        * rpc error: code = OutOfRange desc = state for block ID 075346f6d8cc582093e309136241914fa8fbf16af02ddfe08b076b66d00938a6 not available
        * rpc error: code = OutOfRange desc = state for block ID 075346f6d8cc582093e309136241914fa8fbf16af02ddfe08b076b66d00938a6 not available
        * rpc error: code = OutOfRange desc = state for block ID 075346f6d8cc582093e309136241914fa8fbf16af02ddfe08b076b66d00938a6 not available

It appears that what's happening is the first request to the AN for a block that it has not indexed is forwarded to an execution node. However, since the AN is behind, the EN has already pruned data for this block, resulting in an OutOfRange response code. The gateway then panics.

In this case, I think the gateway should pause and retry.

sideninja commented 2 weeks ago

At this point I don't think EVM GW should communicate with ANs that are not synced. The problem you are experiencing is that the account you set as the COA is not found and this makes gateway panic. I don't think it's beneficial for GW to handle such case gracefully or retry, it will just add complexity. Please correct me if I'm wrong but I don't see a benefit in real-world usage that GW communicates with out-of-sync AN.

peterargue commented 2 weeks ago

it seems that it panics when it fails to get an account from the AN, because the AN has not yet indexed the data. It's definitely possible for an AN to fall behind on indexing, or be restarted. Are you saying that this panic only happens in the case when node's COA has not yet been loaded?

If the GW requires that the AN is fully synced with the network before starting, I think that should be explicitly stated somewhere in the setup docs since I think a common usecase will be to run the GW with a local AN.

m-Peter commented 2 weeks ago

it seems that it panics when it fails to get an account from the AN, because the AN has not yet indexed the data. It's definitely possible for an AN to fall behind on indexing, or be restarted. Are you saying that this panic only happens in the case when node's COA has not yet been loaded?

The error is coming from this check right here: https://github.com/onflow/flow-evm-gateway/blob/main/services/requester/requester.go#L130-L137. This might be because the AN has not yet indexed the latest data, or it can simply be because of a wrong Flow address provided in the corresponding bootstrap flag.

m-Peter commented 2 weeks ago

If the GW requires that the AN is fully synced with the network before starting, I think that should be explicitly stated somewhere in the setup docs since I think a common usecase will be to run the GW with a local AN.

It does not have to be strictly fully synced with the network before starting, but in order to be operational, it should be synced to the block height where the configured COA is created. For testnet specifically, it should have indexed block height 211176670, because this is where the EVM contract was first deployed to testnet.

peterargue commented 2 weeks ago

OK. I'll leave it up to you what should or doesn't need to be documented. I know the GW is still in development and may have some rough edges. This came up when I was testing the setup and was non-obvious what went wrong and what I should do to resolve it.

m-Peter commented 2 weeks ago

We'll certainly add dedicated sections to the README, including the specific block height at which EVM contract is first deployed and any other detail that is non-obvious, especially for connecting to a self-run AN. I have already opened https://github.com/onflow/flow-evm-gateway/pull/500, so that we don't ignore errors coming from COA creation, which is an important part in bootstrapping.

sideninja commented 1 week ago

I believe the pr #500 improves this problem, so the only remaining thing is to add to the documentation.