Base op-reth Archival Node: Can't sync

DaveWK commented 13 hours ago

Describe the bug

I have attempted on a few different setups, but it does not appear I am able to sync an archival node (without --full) and keep it in sync on AWS. I am using a io2 storage (20k iops) with an r7a.2xlarge (64 gigs of ram, 8 AMD EPYC 9R14 cores) and it seems to keep looping through the pipeline stages but never catching up.. It seems like the culprit is MerkelExecute, and I cans ee from the performance that it is not a CPU-bound problem; the single core (since I assume this is a serlialized, single-thread step) is not maxed out, however my disk iops and utilization is always at 100%.. Also the amount of data being transferred is pretty small, so even with 20k iops I am only read/writing around 8 megs of data.

My suspicion is the mdbx file is too "sparse" and it needs some kind of online compaction or "defrag" but don't know how to debug this. Running mdbx_copy is not really a solution since it takes 5 hours to run (and is not an online operation) and I am not able to sync from the available reth-base archive snapshot.

Steps to reproduce

Download the base-reth archive
make op-maxperf
exec op-reth node --chain=base \ --rollup.sequencer-http https://mainnet-sequencer.base.org \ --http --http.port 8545 --ws --ws.port 8546 \ --http.api=web3,debug,eth,net,txpool \ --ws.api=web3,debug,eth,net,txpool \ --metrics=127.0.0.1:9001 \ --ws.origins="*" \ --http.corsdomain="*" \ --rollup.discovery.v4 \ --engine.experimental \ --authrpc.jwtsecret ${HOME}/jwt.hex \ --datadir ${HOME}/oprethdata
Wait around 5-ish hours and see that indeed it keeps looping through the pipeline and never catching up. The longest step regardless of number of blocks seems to be MerkleExec, which always takes around 2-3 hours.

Node logs

No response

Platform(s)

Linux (x86)

What version/commit are you on?

v1.0.8

What database version are you on?

2

Which chain / network are you on?

base mainnet

What type of node are you running?

Archive (default)

What prune config do you use, if any?

n/a

If you've built Reth from source, provide the full command you used

make maxperf-op

Code of Conduct

[X] I agree to follow the Code of Conduct

DaveWK commented 13 hours ago

for the record, without --engine.experimental the performance is even worse

mattsse commented 12 hours ago

very odd, same as https://github.com/paradigmxyz/reth/issues/11306

we haven't tried to reproduce this from the snapshot yet, but resynced base entirely on similar infrastructure as yours without any issues. I wonder if this has anything to do with the most recent snapshot itself, will check.

resyncing base archive takes ~48hrs, so currently I'd recommend this

paradigmxyz / reth