Open zilayo opened 2 days ago
The same actually happened later as well from 06:55 - 07:04. 313,936,113 storage leaves added this time (313,998,906 during the first memory spike above). This is on the second pipeline run after the first pipeline was finished.
{"timestamp":"2024-11-13T06:55:28.084998Z","level":"INFO","fields":{"message":"Preparing stage","pipeline_stages":"8/12","stage":"MerkleExecute","checkpoint":"22328826","target":"22344678"},"target":"reth_node_events::node"}
{"timestamp":"2024-11-13T06:55:28.085011Z","level":"INFO","fields":{"message":"Executing stage","pipeline_stages":"8/12","stage":"MerkleExecute","checkpoint":"22328826","target":"22344678","stage_eta":"23m 12s"},"target":"reth_node_events::node"}
{"timestamp":"2024-11-13T06:55:28.085013Z","level":"DEBUG","fields":{"message":"Continuing inner merkle checkpoint","current":"22328826","target":"22344678","last_account_key":"0xbe61cbebdaec5b008fc3d1b0b63a7e0577f531e96ebcc1a494e2b70499d04f33"},"target":"sync::stages::merkle::exec","span":{"name":"ChainOrchestrator::poll"},"spans":[{"name":"ChainOrchestrator::poll"}]}
...
no more reth_node_events
/ sync::stages
logs during this period
...
{"timestamp":"2024-11-13T07:04:42.778084Z","level":"INFO","fields":{"message":"Committed stage progress","pipeline_stages":"8/12","stage":"MerkleExecute","checkpoint":"22328826","target":"22344678","stage_progress":"76.60%","stage_eta":"8m 58s"},"target":"reth_node_events::node"}
{"timestamp":"2024-11-13T07:04:43.811289Z","level":"DEBUG","fields":{"message":"Event: Handler(ForkchoiceUpdated(ForkchoiceState { head_block_hash: 0xa2c7f361615d286c1f7bf542953c944f934c34f3ebb9de056794dfbcb5f88d29, safe_block_hash: 0x0000000000000000000000000000000000000000000000000000000000000000, finalized_block_hash: 0x0000000000000000000000000000000000000000000000000000000000000000 }, Syncing))"},"target":"reth::cli"}
{"timestamp":"2024-11-13T07:04:43.813734Z","level":"DEBUG","fields":{"message":"Commit","total_duration":"1.035686689s","commit_latency":"Some(CommitLatency(MDBX_commit_latency { preparation: 0, gc_wallclock: 57, audit: 0, write: 0, sync: 67814, ending: 0, whole: 67873, gc_cputime: 0, gc_prof: MDBX_commit_latency__bindgen_ty_1 { wloops: 0, coalescences: 0, wipes: 0, flushes: 0, kicks: 0, work_counter: 0, work_rtime_monotonic: 0, work_xtime_cpu: 0, work_rsteps: 0, work_xpages: 0, work_majflt: 0, self_counter: 0, self_rtime_monotonic: 0, self_xtime_cpu: 0, self_rsteps: 0, self_xpages: 0, self_majflt: 0 } }))","is_read_only":false},"target":"storage::db::mdbx","span":{"name":"ChainOrchestrator::poll"},"spans":[{"name":"ChainOrchestrator::poll"}]}
{"timestamp":"2024-11-13T07:04:43.813816Z","level":"INFO","fields":{"message":"Preparing stage","pipeline_stages":"8/12","stage":"MerkleExecute","checkpoint":"22328826","target":"22344678"},"target":"reth_node_events::node"}
{"timestamp":"2024-11-13T07:04:43.813825Z","level":"DEBUG","fields":{"message":"Continuing inner merkle checkpoint","current":"22328826","target":"22344678","last_account_key":"0xbe67e4c7cd9e0722c7a175135f72c02b5ec1ddba0f73dacde8e007defbe9fca9"},"target":"sync::stages::merkle::exec","span":{"name":"ChainOrchestrator::poll"},"spans":[{"name":"ChainOrchestrator::poll"}]}
{"timestamp":"2024-11-13T07:04:43.813835Z","level":"INFO","fields":{"message":"Executing stage","pipeline_stages":"8/12","stage":"MerkleExecute","checkpoint":"22328826","target":"22344678","stage_eta":"8m 57s"},"target":"reth_node_events::node"}
{"timestamp":"2024-11-13T07:04:44.537942Z","level":"DEBUG","fields":{"message":"Saving inner merkle checkpoint","last_account_key":"0xbe863bb046edd46c453217e574cb150b7321bfe8ccf480c24220e13d4593a6b8"},"target":"sync::stages::merkle::exec","span":{"name":"ChainOrchestrator::poll"},"spans":[{"name":"ChainOrchestrator::poll"}]}
{"timestamp":"2024-11-13T07:04:44.593817Z","level":"INFO","fields":{"message":"Committed stage progress","pipeline_stages":"8/12","stage":"MerkleExecute","checkpoint":"22328826","target":"22344678","stage_progress":"76.62%","stage_eta":"34m 48s"},"target":"reth_node_events::node"}
preparing -> committing -> executing again returns to ~1s afterwards
Hi,
I had the same issue, but resolved it by dropping the merkle table and letting it re-run in execution-sync. It may have to do with a recent bugfix in the trie calculations, where there is a conflict of new and old data in this table. It may also be related to SSD speeds
Describe the bug
Related: https://github.com/paradigmxyz/reth/issues/7282 https://github.com/paradigmxyz/reth/issues/9074
Logs: oom-logs.json (jsonl)
There's been several reports in telegram regarding OOM issues during
MerkleExecute
when syncing base mainnet. Opening this issue to provide additional context. (I didn't OOM but observed a large memory spike that would OOM other machines)This happens between the executing stage and committing stage (normally takes 1s or less), which lasts the same amount of time (5 mins or so) that the memory spike happens.
The
save_execution_checkpoint
function isn't ran / "Saving inner merkle checkpoint" message isn't logged until the end of these 5 mins. https://github.com/paradigmxyz/reth/blob/c326708ffc14f1dae63419521884b0a90b3e037d/crates/stages/stages/src/stages/merkle.rs#L115-L131"Continuing inner merkle checkpoint" is logged so we're in this branch https://github.com/paradigmxyz/reth/blob/c326708ffc14f1dae63419521884b0a90b3e037d/crates/stages/stages/src/stages/merkle.rs#L176-L189
Looks like the root cause is somewhere in this code: https://github.com/paradigmxyz/reth/blob/c326708ffc14f1dae63419521884b0a90b3e037d/crates/stages/stages/src/stages/merkle.rs#L213-L231
Either
root_with_progress() -> calculate()
orwrite_trie_updates
look like the main culprits. Both look heavy on allocations.It's most likely
StateRoot::root_with_progress() -> StateRoot::calculate()
as this records the TrieRootMetrics at the end of the function, which we can see in the screenshots below are recorded when the memory spike ends.Memory grows between 04:40 - 04:46
Huge amount of storage leaves are added once this completed (313,998,906 compared to what was previously 0-10m leaves)
DB also grows 5GB once this is committed
DB Metrics:
Merkle execute stage progress jumps from 52% - 76%
Steps to reproduce
Sync an archive op-reth node on Base Mainnet.
No special config/flags used to run it - just using:
I unwound a node to block 21710931 and began syncing from there. A few hours later is when the memory spike happened, so furthest I can narrow it down to is somewhere between base mainnet blocks 21710931 - 22328826. Should be able to replicate fairly quickly by getting a db that's close to block 21710931 and then let it sync from there.
The same memory spike occurred during MerkleExecute when I was syncing from genesis, but I didn't look into the logs much.
Node logs
Memory spike + stage hang begins at log entry dated
2024-11-13T04:40:49.880967Z
("Executing stage") and lasts until "Committed stage progress" at2024-11-13T04:46:31.943855Z
Note the long delay between "Executing stage" and "Committed stage progress", which normally only takes 1 second. Once this finally commits, preparing -> executing -> committed returns to only taking around a second.
Logs from 30 seconds prior until 1 minute after are attached at the top of the issue. Please note - i removed the engine + net logs to make it easier to read. During this time there were still new engine messages being logged.
Platform(s)
Linux (x86)
What version/commit are you on?
reth-optimism-cli Version: 1.1.1 Commit SHA: 15c230bac20e2b1b3532c8b0d470e815fbc0cc22 Build Timestamp: 2024-11-05T13:12:49.909002171Z Build Features: asm_keccak,jemalloc,optimism Build Profile: maxperf
What database version are you on?
Current database version: 2 Local database version: 2
Which chain / network are you on?
Base mainnet
What type of node are you running?
Archive (default)
What prune config do you use, if any?
[prune] block_interval = 999999999
If you've built Reth from source, provide the full command you used
n/a
Code of Conduct