Closed j1010001 closed 3 weeks ago
We captured some data from the most recent hang. Besides the halt in indexing, the key indicator there is an issue is that goroutines start to climb:
The goroutines profile shows they are waiting to get a lock creating new subscriptions on the blocks provider
This lock is also held when notifying subscriptions about published messages: https://github.com/onflow/flow-evm-gateway/blob/44112e1b0d448c6200763238d7b67a837f361865/models/stream.go#L20-L27
The subscription Notify
method previously had a blocking write of the error to the error channel
https://github.com/onflow/flow-evm-gateway/blob/0fde502e9174e4b0166b60707b0573f949e0a57c/models/stream.go#L58-L63
That has since been fixed (PR) https://github.com/onflow/flow-evm-gateway/blob/44112e1b0d448c6200763238d7b67a837f361865/models/stream.go#L60-L70
Here is the code that deadlocks: https://github.com/onflow/flow-evm-gateway/blob/47ecbea56a0f2ea223d38b5af372099fd20feabd/api/stream.go#L148-L170
The deadlock can happen if the client disconnects, and before Unsubscribe()
is called, the same subscription encounters an error. At that point, there is no listener so the open call to Notify()
blocks, which also blocks Unsubscribe()
, preventing either goroutine from exiting. New subscriptions then block calling Subscribe()
, which is the source of the goroutine leak. Finally, the block indexer also uses this same publisher, so calls to Publish()
within the block ingestion logic also blocks.
Problem
Over the last few days we had multiple occurences of EVM GW stopping block indexing. examples: 1) https://flow-foundation.slack.com/archives/C014WBGR1J9/p1727268188292399 2) https://flow-foundation.slack.com/archives/C014WBGR1J9/p1727227388238869
The Flow process is still running, restarting the flow process on the Gateway temporarily resolves the problem.