[Bug] Various reports that substream-powered-subgraphs silently stop processing blocks

chriswessels commented 1 year ago

Bug report

Reports from StakeSquid, Pinax, Data Nexus all confirm that substream-powered-subgraphs periodically stop processing blocks and silently fail. Reports that restarting graph-node fixes it.

Relevant log output

No response

IPFS hash

No response

Subgraph name or link to explorer

No response

Some information to help us out

[ ] Tick this box if this bug is caused by a regression found in the latest release.
[ ] Tick this box if this bug is specific to the hosted service.
[X] I have searched the issue tracker to make sure this issue is not a duplicate.

OS information

None

jhjhjh94 commented 11 months ago

Next step will be to collect DEBUG logs on the Substreams server when the stall happen, tier1 logs filtered by trace ID of the request (my version and 0.33 of graph-node should both logs it on re-connection) will be enough as a starting point.

From the progress, we see that there is no running jobs anymore but data seems that nothing is then sent. We will need the Substreams server logs to correlate what is happening.

I've also slightly tweaked the Next response received to also print the message received, so will help with the stream visibility. The image graphprotocol/graph-node:canary-investigate-stuck-substreams-ccac28fd4 should be available soon (CI running right now).

@maoueh Attached below are graphnode logs and substreams server logs that corresponds to the time when the stall happened. I have attached additional logs in case you need more data.

graphnode.log substreams_stall.txt (Only trace_id 6f1bf5a6f4f593b59fd9c3a35789fbd3) substreams_tier1.txt (Only tier1) substreams.log (All)

maoueh commented 11 months ago

@jhjhjh94 Thank you, this seems to be a bit different than your last graph node logs. Indeed, I see running_jobs here which means works is being done (but maybe slowly):

ModulesProgress { running_jobs: [
    Job { stage: 4, start_block: 18332000, stop_block: 18333000, processed_blocks: 517, duration_ms: 233705 }, 
    Job { stage: 3, start_block: 18332000, stop_block: 18333000, processed_blocks: 747, duration_ms: 233695 }, 
    Job { stage: 2, start_block: 18332000, stop_block: 18333000, processed_blocks: 877, duration_ms: 233686 }, 
    Job { stage: 1, start_block: 18332000, stop_block: 18333000, processed_blocks: 898, duration_ms: 233677 }, 
    Job { stage: 0, start_block: 18332000, stop_block: 18333000, processed_blocks: 894, duration_ms: 233668 }, 
    Job { stage: 0, start_block: 18333000, stop_block: 18333002, processed_blocks: 0, duration_ms: 233659 }
], modules_stats: [], stages: [
Stage { 
    modules: ["map_pools_created", "store_pools_created", "store_tokens", "store_pool_count"], 
    completed_ranges: [BlockRange { start_block: 12369621, end_block: 18332000 }] 
}, 
Stage { 
    modules: ["map_extract_data_types", "map_tokens_whitelist_pools", "store_pool_liquidities", "store_native_amounts", "store_tokens_whitelist_pools", "store_prices", "store_token_tvl", "store_total_tx_counts", "store_pool_sqrt_price", "store_positions", "store_ticks_liquidities"],     
    completed_ranges: [BlockRange { start_block: 12369621, end_block: 18332000 }] }, 
Stage { 
    modules: ["store_eth_prices"],     
    completed_ranges: [BlockRange { start_block: 12369621, end_block: 18332000 }] }, 
Stage { 
    modules: ["store_derived_tvl", "store_swaps_volume", "store_min_windows", "store_max_windows"],     
    completed_ranges: [BlockRange { start_block: 12369621, end_block: 18332000 }] }, 
Stage { 
    modules: ["store_derived_factory_tvl"],     
    completed_ranges: [BlockRange { start_block: 12369621, end_block: 18332000 }] }, 
Stage { 
    modules: ["graph_out"],     
    completed_ranges: [] 
}],

You can see here the running jobs they are still crunching so work is in progress. While it "seems" stuck, it's not really in the case. Could be seen as "still working" more than "stuck".

maoueh commented 11 months ago

For the presentation of the actual progress @zorancv @azf20 message, I’m thinking maybe that we could have a succinct presentation in INFO and a deeper one in DEBUG/TRACE. We could define the important pieces on INFO to be stages (ID + completed range) and running_jobs (processed range Job (s0) 18,000,000 | +573/1000 maybe as a compact representation).

I've also rebase my debugging branch on top of v0.33.0, Docker image should is available with tag canary-investigate-stuck-substreams-v0.33.0-2dbfca2d7

azf20 commented 10 months ago

@maoueh this PR was merged which should help with monitoring here https://github.com/graphprotocol/graph-node/pull/4935 (thanks @zorancv ) Are there open threads on this issue?

maoueh commented 10 months ago

Are there open threads on this issue?

I don't understand the question, which "thread"?

azf20 commented 10 months ago

@maoueh I meant are there any open investigations on this issue / what are the next steps?

matthewdarwin commented 10 months ago

I sent some notes to @maoueh in a discussion thread in slack. This shows substreams stall without any subgraph being involved, with debugging enabled on the server.

My testcase (referencing a pinax local tier1 node):

substreams-sink-noop wax-sfst85:9000 https://github.com/pinax-network/substreams-atomicmarket/releases/download/v0.3.0/atomicmarket-v0.3.0.spkg graph_out 32848800: --plaintext

Looking at the screenshot, we can see results are sent at the start and then results stop being sent, while the server keeps processing data.

I need more direction to understand what is needed for next steps of debugging.

maoueh commented 10 months ago

@azf20 We were waiting on DEBUG logs from the production service that runs the Substreams but Matthew seems that the logs he gave us might contain the information we need.

Now we need someone internally to check that more thoroughly.