Closed chriswessels closed 5 months ago
Next step will be to collect
DEBUG
logs on the Substreams server when the stall happen,tier1
logs filtered by trace ID of the request (my version and 0.33 ofgraph-node
should both logs it on re-connection) will be enough as a starting point.From the progress, we see that there is no
running jobs
anymore but data seems that nothing is then sent. We will need the Substreams server logs to correlate what is happening.I've also slightly tweaked the
Next response received
to also print the message received, so will help with the stream visibility. The imagegraphprotocol/graph-node:canary-investigate-stuck-substreams-ccac28fd4
should be available soon (CI running right now).
@maoueh Attached below are graphnode logs and substreams server logs that corresponds to the time when the stall happened. I have attached additional logs in case you need more data.
graphnode.log substreams_stall.txt (Only trace_id 6f1bf5a6f4f593b59fd9c3a35789fbd3) substreams_tier1.txt (Only tier1) substreams.log (All)
@jhjhjh94 Thank you, this seems to be a bit different than your last graph node logs. Indeed, I see running_jobs
here which means works is being done (but maybe slowly):
ModulesProgress { running_jobs: [
Job { stage: 4, start_block: 18332000, stop_block: 18333000, processed_blocks: 517, duration_ms: 233705 },
Job { stage: 3, start_block: 18332000, stop_block: 18333000, processed_blocks: 747, duration_ms: 233695 },
Job { stage: 2, start_block: 18332000, stop_block: 18333000, processed_blocks: 877, duration_ms: 233686 },
Job { stage: 1, start_block: 18332000, stop_block: 18333000, processed_blocks: 898, duration_ms: 233677 },
Job { stage: 0, start_block: 18332000, stop_block: 18333000, processed_blocks: 894, duration_ms: 233668 },
Job { stage: 0, start_block: 18333000, stop_block: 18333002, processed_blocks: 0, duration_ms: 233659 }
], modules_stats: [], stages: [
Stage {
modules: ["map_pools_created", "store_pools_created", "store_tokens", "store_pool_count"],
completed_ranges: [BlockRange { start_block: 12369621, end_block: 18332000 }]
},
Stage {
modules: ["map_extract_data_types", "map_tokens_whitelist_pools", "store_pool_liquidities", "store_native_amounts", "store_tokens_whitelist_pools", "store_prices", "store_token_tvl", "store_total_tx_counts", "store_pool_sqrt_price", "store_positions", "store_ticks_liquidities"],
completed_ranges: [BlockRange { start_block: 12369621, end_block: 18332000 }] },
Stage {
modules: ["store_eth_prices"],
completed_ranges: [BlockRange { start_block: 12369621, end_block: 18332000 }] },
Stage {
modules: ["store_derived_tvl", "store_swaps_volume", "store_min_windows", "store_max_windows"],
completed_ranges: [BlockRange { start_block: 12369621, end_block: 18332000 }] },
Stage {
modules: ["store_derived_factory_tvl"],
completed_ranges: [BlockRange { start_block: 12369621, end_block: 18332000 }] },
Stage {
modules: ["graph_out"],
completed_ranges: []
}],
You can see here the running jobs they are still crunching so work is in progress. While it "seems" stuck, it's not really in the case. Could be seen as "still working" more than "stuck".
For the presentation of the actual progress @zorancv @azf20 message, I’m thinking maybe that we could have a succinct presentation in INFO
and a deeper one in DEBUG/TRACE
. We could define the important pieces on INFO to be stages (ID + completed range) and running_jobs (processed range Job (s0) 18,000,000 | +573/1000
maybe as a compact representation).
I've also rebase my debugging branch on top of v0.33.0
, Docker image should is available with tag canary-investigate-stuck-substreams-v0.33.0-2dbfca2d7
@maoueh this PR was merged which should help with monitoring here https://github.com/graphprotocol/graph-node/pull/4935 (thanks @zorancv ) Are there open threads on this issue?
Are there open threads on this issue?
I don't understand the question, which "thread"?
@maoueh I meant are there any open investigations on this issue / what are the next steps?
I sent some notes to @maoueh in a discussion thread in slack. This shows substreams stall without any subgraph being involved, with debugging enabled on the server.
My testcase (referencing a pinax local tier1 node):
substreams-sink-noop wax-sfst85:9000 https://github.com/pinax-network/substreams-atomicmarket/releases/download/v0.3.0/atomicmarket-v0.3.0.spkg graph_out 32848800: --plaintext
Looking at the screenshot, we can see results are sent at the start and then results stop being sent, while the server keeps processing data.
I need more direction to understand what is needed for next steps of debugging.
@azf20 We were waiting on DEBUG logs from the production service that runs the Substreams but Matthew seems that the logs he gave us might contain the information we need.
Now we need someone internally to check that more thoroughly.
There is a bug pending BSC dev team to fix which might have caused some issues, however there is currently a workaround https://github.com/bnb-chain/bsc/issues/2212
I believe this issue issue is reproducible by simply using the substreams
cli. (ie not a graph-node issue). I sent a DEBUG log. From my reading of the log it does not seem to explain why things are stuck.
New substreams release today. Let's see if there is some improvement here.
Some improvements, but substreams still get stuck.
It feels to me the problem is the handoff between processing historical blocks to live blocks when the substream has no "store" module.
Github issue for above: https://github.com/streamingfast/substreams/issues/421
Suggest to close this issue, and use the substreams issue above to continue this discussion.
Bug report
Reports from StakeSquid, Pinax, Data Nexus all confirm that substream-powered-subgraphs periodically stop processing blocks and silently fail. Reports that restarting graph-node fixes it.
Relevant log output
No response
IPFS hash
No response
Subgraph name or link to explorer
No response
Some information to help us out
OS information
None