Reth very slow to catchup

ncavedale-xlabs commented 1 month ago

Describe the bug

Hi team! We're running a testnet Berachain node, using reth as the execution client. We noticed that whenever reth service is down for just a few minutes (like 2 or 3 minutes), it takes several hours to catchup (during this time, reth does not report block height). We're using pruning settings, to keep some history.

Steps to reproduce

Have a caught up node
Stop reth for 5 minutes or even less
Restart reth and measure how long it takes to catch up again

Node logs

`reth` service logs:

Oct 17 12:54:25 reth[2168334]: 2024-10-17T12:54:25.286179Z  INFO Received forkchoice updated message when syncing head_block_hash=0xf6b083e91bef89b139ffd48d5f1446282802b26ea8b2169792287f32b5415776 safe_block_hash=0xa677f711b49f24c4d4f3fe42d68c8b6375a2903bcae410cab281221f0555d949 finalized_block_hash=0xa677f711b49f24c4d4f3fe42d68c8b6375a2903bcae410cab281221f0555d949
Oct 17 12:54:26 reth[2168334]: 2024-10-17T12:54:26.185520Z  INFO Finished stage pipeline_stages=9/14 stage=MerkleExecute checkpoint=5690672 target=5690672 stage_progress=100.00%
Oct 17 12:54:26 reth[2168334]: 2024-10-17T12:54:26.630001Z  INFO Status connected_peers=47 latest_block=5690093
Oct 17 12:54:28 reth[2168334]: 2024-10-17T12:54:28.896596Z  INFO Preparing stage pipeline_stages=10/14 stage=TransactionLookup checkpoint=5690093 target=5690672
Oct 17 12:54:28 reth[2168334]: 2024-10-17T12:54:28.896622Z  INFO Executing stage pipeline_stages=10/14 stage=TransactionLookup checkpoint=5690093 target=5690672
Oct 17 12:54:28 reth[2168334]: 2024-10-17T12:54:28.896622Z  INFO Updating transaction lookup tx_range=5690093..=5690672
Oct 17 12:54:28 reth[2168334]: 2024-10-17T12:54:28.898291Z  INFO Calculating transaction hashes tx_range=372315732..372397829
Oct 17 12:54:34 reth[2168334]: 2024-10-17T12:54:34.412474Z  INFO Inserting hashes append_only=false progress=10.00%
Oct 17 12:54:35 reth[2168334]: 2024-10-17T12:54:35.734721Z  INFO Received forkchoice updated message when syncing head_block_hash=0xa82fd262c095731e66e93474781a7ea5aa3441919b0f9f29d3dba4bfb9b8a6c0 safe_block_hash=0xf6b083e91bef89b139ffd48d5f1446282802b26ea8b2169792287f32b5415776 finalized_block_hash=0xf6b083e91bef89b139ffd48d5f1446282802b26ea8b2169792287f32b5415776
Oct 17 12:54:37 reth[2168334]: 2024-10-17T12:54:37.121047Z  INFO Received forkchoice updated message when syncing head_block_hash=0x7e80d467d711181463b7b818e596f70cd1ccf2e0d6a2fe4e247a58de8cabd879 safe_block_hash=0xa82fd262c095731e66e93474781a7ea5aa3441919b0f9f29d3dba4bfb9b8a6c0 finalized_block_hash=0xa82fd262c095731e66e93474781a7ea5aa3441919b0f9f29d3dba4bfb9b8a6c0
Oct 17 12:54:39 reth[2168334]: 2024-10-17T12:54:39.705310Z  INFO Inserting hashes append_only=false progress=20.00%
Oct 17 12:54:44 reth[2168334]: 2024-10-17T12:54:44.359277Z  INFO Received forkchoice updated message when syncing head_block_hash=0x802dea18a9e2753f602773afe14e725adcf3101438e01b125f29a7441bf462dc safe_block_hash=0x7e80d467d711181463b7b818e596f70cd1ccf2e0d6a2fe4e247a58de8cabd879 finalized_block_hash=0x7e80d467d711181463b7b818e596f70cd1ccf2e0d6a2fe4e247a58de8cabd879
Oct 17 12:54:44 reth[2168334]: 2024-10-17T12:54:44.875227Z  INFO Inserting hashes append_only=false progress=30.00%
Oct 17 12:54:45 reth[2168334]: 2024-10-17T12:54:45.643071Z  INFO Received forkchoice updated message when syncing head_block_hash=0x4dd1bedfd1e275e1de642c42200b3e39f68f49215e7b78bc61f52991a1032c7d safe_block_hash=0x802dea18a9e2753f602773afe14e725adcf3101438e01b125f29a7441bf462dc finalized_block_hash=0x802dea18a9e2753f602773afe14e725adcf3101438e01b125f29a7441bf462d

.cache/reth/logs/80084/reth.log:

2024-09-13T21:44:11.250095Z  INFO reth::cli: Initialized tracing, debug log directory: /var/lib/berachain/.cache/reth/logs/80084
2024-09-13T21:44:11.251290Z  INFO reth::cli: reth init starting
2024-09-13T21:44:11.251515Z  INFO reth::cli: Opening storage db_path="/var/lib/berachain/data/execution/db" sf_path="/var/lib/berachain/data/execution/static_files"
2024-09-13T21:44:11.290194Z DEBUG storage::db::mdbx: Commit total_duration=2.909567ms commit_latency=Some(CommitLatency(MDBX_commit_latency { preparation: 0, gc_wallclock: 2, audit: 0, write: 0, sync: 183, ending: 0, whole: 186, gc_cputime: 0, gc_prof: MDBX_commit_latency__bindgen_ty_1 { wloops: 0, coalescences: 0, wipes: 0, flushes: 0, kicks: 0, work_counter: 0, work_rtime_monotonic: 0, work_xtime_cpu: 0, work_rsteps: 0, work_xpages: 0, work_majflt: 0, self_counter: 0, self_rtime_monotonic: 0, self_xtime_cpu: 0, self_rsteps: 0, self_xpages: 0, self_majflt: 0 } })) is_read_only=false
2024-09-13T21:44:11.295126Z  INFO reth::cli: Verifying storage consistency.
2024-09-13T21:44:11.319904Z DEBUG reth::cli: Initializing genesis chain=80084 genesis=0xa42850da28e09e0e7f75a76fe2b96c8b869a206bfadc9a4450ca345153b0c7d9
2024-09-13T21:44:11.319970Z DEBUG reth_db_common::init: Writing genesis block.
2024-09-13T21:44:11.327371Z DEBUG storage::db::mdbx: Commit total_duration=2.098328ms commit_latency=Some(CommitLatency(MDBX_commit_latency { preparation: 0, gc_wallclock: 2, audit: 0, write: 0, sync: 129, ending: 0, whole: 133, gc_cputime: 0, gc_prof: MDBX_commit_latency__bindgen_ty_1 { wloops: 0, coalescences: 0, wipes: 0, flushes: 0, kicks: 0, work_counter: 0, work_rtime_monotonic: 0, work_xtime_cpu: 0, work_rsteps: 0, work_xpages: 0, work_majflt: 0, self_counter: 0, self_rtime_monotonic: 0, self_xtime_cpu: 0, self_rsteps: 0, self_xpages: 0, self_majflt: 0 } })) is_read_only=false
2024-09-13T21:44:11.333240Z DEBUG provider::static_file: Commit segment=Headers path="/var/lib/berachain/data/execution/static_files/static_file_headers_0_499999" duration=5.842513ms
2024-09-13T21:44:11.336732Z DEBUG provider::static_file: Commit segment=Transactions path="/var/lib/berachain/data/execution/static_files/static_file_transactions_0_499999" duration=3.369777ms
2024-09-13T21:44:11.342536Z DEBUG provider::static_file: Commit segment=Receipts path="/var/lib/berachain/data/execution/static_files/static_file_receipts_0_499999" duration=5.752542ms
2024-09-13T21:44:11.342714Z  INFO reth::cli: Genesis block written hash=0xa42850da28e09e0e7f75a76fe2b96c8b869a206bfadc9a4450ca345153b0c7d9
2024-09-13T22:53:48.250130Z  INFO reth::cli: Initialized tracing, debug log directory: /var/lib/berachain/.cache/reth/logs/80084
2024-09-13T22:53:48.251446Z  INFO reth::cli: reth init starting
2024-09-13T22:53:48.253081Z  INFO reth::cli: Opening storage db_path="/var/lib/berachain/data/execution/db" sf_path="/var/lib/berachain/data/execution/static_files"
2024-09-13T22:53:48.302674Z  INFO reth::cli: Verifying storage consistency.
2024-09-13T22:53:48.334211Z DEBUG reth::cli: Initializing genesis chain=80084 genesis=0xa42850da28e09e0e7f75a76fe2b96c8b869a206bfadc9a4450ca345153b0c7d9
2024-09-13T22:53:48.334284Z DEBUG reth_db_common::init: Genesis already written, skipping.
2024-09-13T22:53:48.334344Z  INFO reth::cli: Genesis block written hash=0xa42850da28e09e0e7f75a76fe2b96c8b869a206bfadc9a4450ca345153b0c7d9

Platform(s)

Linux (x86)

What version/commit are you on?

We noticed this issue with the following version:

v1.0.6
v1.1.0

What database version are you on?

For some reason this seems to point to mainnet, even though this is a testnet node (mainnet has not been launched yet):

$ reth db version
2024-10-17T13:22:28.145933Z  INFO Initialized tracing, debug log directory: /var/lib/berachain/.cache/reth/logs/mainnet
Error: Datadir does not exist: "/var/lib/berachain/.local/share/reth/mainnet"

Location:
    /var/lib/berachain/source/execution/crates/cli/commands/src/db/mod.rs:76:9

Which chain / network are you on?

Berachain testnet

What type of node are you running?

Pruned with custom reth.toml config

What prune config do you use, if any?

[prune.parts]
sender_recovery = { distance = 216_000 } # Prune all transaction senders before the block `head-216000`, i.e. keep transaction senders for the last 100001 blocks

transaction_lookup = { distance = 216_000 } # Prune all transaction lookups before the block `head-216000`

receipts = { distance = 216_000 } # Prune all receipts before the block `head-216000`

account_history = { distance = 216_000 } # Prune all historical account states before the block `head-216000`

storage_history = { distance = 216_000 } # Prune all historical storage states before the block `head-216000`

If you've built Reth from source, provide the full command you used

cargo build --locked --profile maxperf

Code of Conduct

[X] I agree to follow the Code of Conduct

andreclaro commented 1 month ago

We have also noticed that the safe and finalized blocks are not progressing since the upgrade to v1.1.0:

Latest Block Number: 5691261
Safe Block Number: 5601958
Finalized Block Number: 5601958

Rjected commented 1 month ago

It looks like the node is in pipeline sync, which syncs a large batch of blocks in the first log. How long does that take to complete?

ncavedale-xlabs commented 1 month ago

It's been catching up since Tuesday (disk was full and we had to extend it). It was down for less than an hour.

Rjected commented 1 month ago

I understand that, but how long did the pipeline run in the logs take? In the logs we have:

Oct 17 12:54:28 reth[2168334]: 2024-10-17T12:54:28.896596Z  INFO Preparing stage pipeline_stages=10/14 stage=TransactionLookup checkpoint=5690093 target=5690672

There should be stages before this, and a Finish stage to finish the cycle. It would be good to know how long this specific range of blocks took. Additionally, is there a berachain reth repo? It's possible they may need to tune some stage parameters on their end, to improve performance on a chain like berachain

mattsse commented 1 month ago

the datadir is chosen based on the configured chainspec

how did you run the command and what's the chainspec/genesis.json?

ncavedale-xlabs commented 1 month ago

I understand that, but how long did the pipeline run in the logs take? In the logs we have:
Oct 17 12:54:28 reth[2168334]: 2024-10-17T12:54:28.896596Z  INFO Preparing stage pipeline_stages=10/14 stage=TransactionLookup checkpoint=5690093 target=5690672
There should be stages before this, and a Finish stage to finish the cycle. It would be good to know how long this specific range of blocks took. Additionally, is there a berachain reth repo? It's possible they may need to tune some stage parameters on their end, to improve performance on a chain like berachain

They don't have their own reth version.

The pipeline ran for about 20 minutes:

Oct 17 12:38:56 reth[2168334]: 2024-10-17T12:38:56.077450Z  INFO Preparing stage pipeline_stages=1/14 stage=Headers checkpoint=5690093 target=None
...
...
Oct 17 12:59:01 reth[2168334]: 2024-10-17T12:59:01.272894Z  INFO Finished stage pipeline_stages=14/14 stage=Finish checkpoint=5690672 target=5690672

ncavedale-xlabs commented 1 month ago

the datadir is chosen based on the configured chainspec

how did you run the command and what's the chainspec/genesis.json?

The command we use is:

reth node \
    --chain=/var/lib/berachain/data/execution/genesis.json \
    --datadir=/var/lib/berachain/data/execution \
    --engine.experimental

The genesis file uses "chainId": 80084: https://raw.githubusercontent.com/berachain/beacon-kit/main/testing/networks/80084/eth-genesis.json

mattsse commented 1 month ago

I see that's why the reth db version command is crashing because if no --chain arg provided it falls back to mainnet by default

mattsse commented 1 month ago

do you have the full logs for this run

Oct 17 12:38:56 reth[2168334]: 2024-10-17T12:38:56.077450Z INFO Preparing stage pipeline_stages=1/14 stage=Headers checkpoint=5690093 target=None ... ... Oct 17 12:59:01 reth[2168334]: 2024-10-17T12:59:01.272894Z INFO Finished stage pipeline_stages=14/14 stage=Finish checkpoint=5690672 target=5690672

would like to see in which stage the most time was spent

ncavedale-xlabs commented 1 month ago

Attached is the full log for this run: full_log.txt

github-actions[bot] commented 1 week ago

This issue is stale because it has been open for 21 days with no activity.

andreclaro commented 1 week ago

@mattsse and @Rjected , I have just upgraded one of our Ethereum nodes, from v1.0.6 to v1.1.0

The upgrade provokes the client to go through all the 12 stage pipelines, causing a huge outage of the node.

Upgraded was done 2h 10min ago and node is still on stage 2 (Committed stage progress pipeline_stages=2/12 stage=Bodies checkpoint=16580000 target=21165513 stage_progress=78.33%).

CC: @ncavedale-xlabs

totoCZ commented 12 hours ago

I'm now syncing a backlog of 56 days on mainnet and it's running for 10+ hours still in 9/14. I think it's far slower than what Geth would do.

Rjected commented 1 hour ago

@totoCZ and @andreclaro can we get exact machine specs (cpu, ram, exact disk model especially) or instance types?

totoCZ commented 1 hour ago

2x.Gold 5120 256G RAM 8x Crucial MX500 2TB

totoCZ commented 1 hour ago

I just checked it now and it is synced and following the chain again

paradigmxyz / reth