onflow / flow-evm-gateway

FlowEVM Gateway implements an Ethereum-equivalent JSON-RPC API for EVM clients to use
https://developers.flow.com/evm/about
Apache License 2.0
11 stars 10 forks source link

Gateway stops indexing new blocks #590

Closed j1010001 closed 3 weeks ago

j1010001 commented 1 month ago

Problem

Over the last few days we had multiple occurences of EVM GW stopping block indexing. examples: 1) https://flow-foundation.slack.com/archives/C014WBGR1J9/p1727268188292399 2) https://flow-foundation.slack.com/archives/C014WBGR1J9/p1727227388238869

The Flow process is still running, restarting the flow process on the Gateway temporarily resolves the problem.

peterargue commented 1 month ago

We captured some data from the most recent hang. Besides the halt in indexing, the key indicator there is an issue is that goroutines start to climb: Image

The goroutines profile shows they are waiting to get a lock creating new subscriptions on the blocks provider Image

This lock is also held when notifying subscriptions about published messages: https://github.com/onflow/flow-evm-gateway/blob/44112e1b0d448c6200763238d7b67a837f361865/models/stream.go#L20-L27

The subscription Notify method previously had a blocking write of the error to the error channel https://github.com/onflow/flow-evm-gateway/blob/0fde502e9174e4b0166b60707b0573f949e0a57c/models/stream.go#L58-L63

That has since been fixed (PR) https://github.com/onflow/flow-evm-gateway/blob/44112e1b0d448c6200763238d7b67a837f361865/models/stream.go#L60-L70

Here is the code that deadlocks: https://github.com/onflow/flow-evm-gateway/blob/47ecbea56a0f2ea223d38b5af372099fd20feabd/api/stream.go#L148-L170

The deadlock can happen if the client disconnects, and before Unsubscribe() is called, the same subscription encounters an error. At that point, there is no listener so the open call to Notify() blocks, which also blocks Unsubscribe(), preventing either goroutine from exiting. New subscriptions then block calling Subscribe(), which is the source of the goroutine leak. Finally, the block indexer also uses this same publisher, so calls to Publish() within the block ingestion logic also blocks.