paradigmxyz / reth

Modular, contributor-friendly and blazing-fast implementation of the Ethereum protocol, in Rust
https://reth.rs/
Apache License 2.0
3.54k stars 916 forks source link

Repeated Out of Memory Kills during MerkleExecute Stage #9074

Open 0xMafiat opened 2 weeks ago

0xMafiat commented 2 weeks ago

Describe the bug

I'm experiencing an issue with op-reth where it consistently reaches an out-of-memory (OOM) condition during the MerkleExecute stage (8/12) of synchronization. The sync starts normally but fails repeatedly at around 40% progress in this stage, with the system logging multiple OOM kills.

Environment:

Behavior Observed:

Attempted Solutions:

Increased verbosity to debug further, identifying frequent "Next hook is not ready hook='StaticFile'" logs, potentially indicating issues with file access or processing efficiency.

Questions/Requests:

I'm seeking any assistance or suggestions on configuration adjustments that could stabilize memory usage and allow successful node synchronization.

Steps to reproduce

Steps to reproduce the behavior:

  1. Set up the node using the provided Docker configuration.
  2. Start the node synchronization.
  3. Observe that the process progresses normally until it reaches the MerkleExecute stage.
  4. At around 40% progress in the MerkleExecute stage, the node experiences OOM kills and restarts the synchronization process.

Docker Compose Configuration:

services:
  execution_client:
    image: ghcr.io/paradigmxyz/op-reth:v1.0.0
    container_name: base_reth
    restart: always
    command: >
        node
        --full
        --chain base
        --datadir /data
        --rollup.sequencer-http https://mainnet-sequencer.base.org
        --rollup.disable-tx-pool-gossip
        --port 30305
        --discovery.port 30305
        --enable-discv5-discovery
        --discovery.v5.addr 0.0.0.0
        --discovery.v5.port 30306
        --discovery.v5.port.ipv6 30306
        --http
        --http.addr 0.0.0.0
        --http.port 8547
        --http.corsdomain "*"
        --http.api all
        --ws
        --ws.addr 0.0.0.0
        --ws.port 8548
        --ws.origins "*"
        --ws.api all
        --authrpc.jwtsecret /jwt/jwt.hex
        --authrpc.addr 0.0.0.0
        --authrpc.port 9551
    ports:
      - "8547:8547/tcp"
      - "8548:8548/tcp"
      - "30305:30305/tcp"
      - "30305:30305/udp"
      - "30306:30306/tcp"
      - "30306:30306/udp"
    volumes:
      - base_reth_data:/data
      - base_reth_jwt:/jwt

Node logs

Logs until process is killed:

2024-06-24T04:32:51.226052Z  INFO Starting reth version="0.2.0-beta.9 (7b435e0)"
2024-06-24T04:32:51.227215Z  INFO Opening database path="/data/db"
2024-06-24T04:32:51.252121Z  INFO Configuration loaded path="/data/reth.toml"
2024-06-24T04:32:51.274274Z  INFO Verifying storage consistency.
2024-06-24T04:32:51.293456Z  INFO Database opened
2024-06-24T04:32:51.295294Z  INFO 
Pre-merge hard forks (block based):
- Frontier                         @0
- Homestead                        @0
- Tangerine                        @0
- SpuriousDragon                   @0
- Byzantium                        @0
- Constantinople                   @0
- Petersburg                       @0
- Istanbul                         @0
- MuirGlacier                      @0
- Berlin                           @0
- London                           @0
- ArrowGlacier                     @0
- GrayGlacier                      @0
- Bedrock                          @0
Merge hard forks:
- Paris                            @0 (network is known to be merged)
Post-merge hard forks (timestamp based):
- Regolith                         @0
- Shanghai                         @1704992401
- Canyon                           @1704992401
- Cancun                           @1710374401
- Ecotone                          @1710374401
- Fjord                            @1720627201
2024-06-24T04:32:51.377867Z  INFO Transaction pool initialized
2024-06-24T04:32:51.378472Z  INFO Connecting to P2P network
2024-06-24T04:32:51.379274Z  INFO Loading saved peers file=/data/known-peers.json
2024-06-24T04:32:52.712554Z  INFO StaticFileProducer initialized
2024-06-24T04:32:52.714256Z  INFO Pruner initialized prune_config=PruneConfig { block_interval: 5, segments: PruneModes { sender_recovery: Some(Full), transaction_lookup: None, receipts: None, account_history: Some(Distance(10064)), storage_history: Some(Distance(10064)), receipts_log_filter: ReceiptsLogPruneConfig({}) } }
2024-06-24T04:32:52.715821Z  INFO Consensus engine initialized
2024-06-24T04:32:52.715956Z  INFO Engine API handler initialized
2024-06-24T04:32:52.726244Z  INFO RPC auth server started url=0.0.0.0:9551
2024-06-24T04:32:52.727403Z  INFO RPC IPC server started path=/tmp/reth.ipc
2024-06-24T04:32:52.727439Z  INFO RPC HTTP server started url=0.0.0.0:8547
2024-06-24T04:32:52.727447Z  INFO RPC WS server started url=0.0.0.0:8548
2024-06-24T04:32:52.727940Z  INFO Starting consensus engine
2024-06-24T04:32:52.730580Z  INFO Target block already reached checkpoint=16100209 target=Hash(0xa50440445fbdc309bfc4fb4f802a948b71dbfe276b0e1a4a790293dccefa9be6)
2024-06-24T04:32:52.730688Z  INFO Preparing stage pipeline_stages=1/12 stage=Headers checkpoint=16100209 target=None
2024-06-24T04:32:52.730728Z  INFO Executing stage pipeline_stages=1/12 stage=Headers checkpoint=16100209 target=None
2024-06-24T04:32:52.730738Z  INFO Finished stage pipeline_stages=1/12 stage=Headers checkpoint=16100209 target=None stage_progress=100.00%
2024-06-24T04:32:52.733692Z  INFO Preparing stage pipeline_stages=2/12 stage=Bodies checkpoint=16100209 target=16100209
2024-06-24T04:32:52.733746Z  INFO Executing stage pipeline_stages=2/12 stage=Bodies checkpoint=16100209 target=16100209
2024-06-24T04:32:52.733848Z  INFO Finished stage pipeline_stages=2/12 stage=Bodies checkpoint=16100209 target=16100209 stage_progress=100.00%
2024-06-24T04:32:52.739741Z  INFO Preparing stage pipeline_stages=3/12 stage=SenderRecovery checkpoint=16100209 target=16100209
2024-06-24T04:32:52.739786Z  INFO Executing stage pipeline_stages=3/12 stage=SenderRecovery checkpoint=16100209 target=16100209
2024-06-24T04:32:52.739798Z  INFO Finished stage pipeline_stages=3/12 stage=SenderRecovery checkpoint=16100209 target=16100209 stage_progress=100.00%
2024-06-24T04:32:52.747620Z  INFO Preparing stage pipeline_stages=4/12 stage=Execution checkpoint=16100209 target=16100209
2024-06-24T04:32:52.747672Z  INFO Executing stage pipeline_stages=4/12 stage=Execution checkpoint=16100209 target=16100209
2024-06-24T04:32:52.747795Z  INFO Finished stage pipeline_stages=4/12 stage=Execution checkpoint=16100209 target=16100209 stage_progress=100.00%
2024-06-24T04:32:52.750320Z  INFO Stage is always skipped
2024-06-24T04:32:52.750336Z  INFO Preparing stage pipeline_stages=5/12 stage=MerkleUnwind checkpoint=16100209 target=16100209
2024-06-24T04:32:52.750358Z  INFO Executing stage pipeline_stages=5/12 stage=MerkleUnwind checkpoint=16100209 target=16100209
2024-06-24T04:32:52.750366Z  INFO Finished stage pipeline_stages=5/12 stage=MerkleUnwind checkpoint=16100209 target=16100209
2024-06-24T04:32:52.752756Z  INFO Preparing stage pipeline_stages=6/12 stage=AccountHashing checkpoint=16100209 target=16100209
2024-06-24T04:32:52.752824Z  INFO Executing stage pipeline_stages=6/12 stage=AccountHashing checkpoint=16100209 target=16100209
2024-06-24T04:32:52.752836Z  INFO Finished stage pipeline_stages=6/12 stage=AccountHashing checkpoint=16100209 target=16100209 stage_progress=100.00%
2024-06-24T04:32:52.755615Z  INFO Preparing stage pipeline_stages=7/12 stage=StorageHashing checkpoint=16100209 target=16100209
2024-06-24T04:32:52.755649Z  INFO Executing stage pipeline_stages=7/12 stage=StorageHashing checkpoint=16100209 target=16100209
2024-06-24T04:32:52.755773Z  INFO Finished stage pipeline_stages=7/12 stage=StorageHashing checkpoint=16100209 target=16100209 stage_progress=100.00%
2024-06-24T04:32:52.758195Z  INFO Preparing stage pipeline_stages=8/12 stage=MerkleExecute checkpoint=0 target=16100209
2024-06-24T04:32:52.758225Z  INFO Executing stage pipeline_stages=8/12 stage=MerkleExecute checkpoint=0 target=16100209
2024-06-24T04:32:54.276523Z  INFO Received forkchoice updated message when syncing head_block_hash=0x8f1207f62a78c35efa17ccfbd32d71b1138801ed25dfbeb1b03db4ec3ea6cde8 safe_block_hash=0x0000000000000000000000000000000000000000000000000000000000000000 finalized_block_hash=0x0000000000000000000000000000000000000000000000000000000000000000
2024-06-24T04:32:55.717472Z  INFO Status connected_peers=0 freelist=17 stage=MerkleExecute checkpoint=0 target=16100209
2024-06-24T04:33:20.717577Z  INFO Status connected_peers=3 freelist=17 stage=MerkleExecute checkpoint=0 target=16100209
...
2024-06-24T04:49:35.717121Z  INFO Status connected_peers=16 freelist=17 stage=MerkleExecute checkpoint=0 target=16100209
2024-06-24T04:50:00.717121Z  INFO Status connected_peers=17 freelist=17 stage=MerkleExecute checkpoint=0 target=16100209
2024-06-24T04:50:19.718302Z  INFO Starting reth version="0.2.0-beta.9 (7b435e0)"

dmesg logs:

[Mon Jun 24 16:57:28 2024] Out of memory: Killed process 145001 (op-reth) total-vm:4332918936kB, anon-rss:15685612kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:68624kB oom_score_adj:0
[Mon Jun 24 17:13:32 2024] Out of memory: Killed process 145702 (op-reth) total-vm:4333045852kB, anon-rss:15653920kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:69880kB oom_score_adj:0
...

Logs with debug mode:

2024-06-24T20:45:40.750248Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-06-24T20:45:40.766823Z  INFO reth::cli: Status connected_peers=21 freelist=17 stage=MerkleExecute checkpoint=0 target=16100209
2024-06-24T20:45:42.103075Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:42.106388Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:42.106697Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-06-24T20:45:43.003980Z DEBUG net: Session established remote_addr=3.10.19.215:30316 client_version=Nodecrawler/v1.14.0-unstable/linux-amd64/go1.22.2 peer_id=0xdec4910b03c1f6069b41ec38d4cb2cdfe607a27244455bcd5764b62ab6dec9dbfb925bdf5a609a274d3940489605b05901db015b5bea43368cdbc12acca647b9 total_active=22 kind=outgoing peer_enode=enode://dec4910b03c1f6069b41ec38d4cb2cdfe607a27244455bcd5764b62ab6dec9dbfb925bdf5a609a274d3940489605b05901db015b5bea43368cdbc12acca647b9@3.10.19.215:30316
2024-06-24T20:45:43.257297Z DEBUG net::session: failed to receive message err=disconnected remote_peer_id=0xdec4910b03c1f6069b41ec38d4cb2cdfe607a27244455bcd5764b62ab6dec9dbfb925bdf5a609a274d3940489605b05901db015b5bea43368cdbc12acca647b9
2024-06-24T20:45:43.387189Z DEBUG net: Session established remote_addr=3.10.19.215:30318 client_version=Nodecrawler/v1.14.0-unstable/linux-amd64/go1.22.2 peer_id=0x1cf7db662897dcd2af946fdd47e5b6d8273846e6a6bef4f1f3ec1f2b38c77c6a2382fb9409e43681c15cc043298d9476f2faa14c1adb805c15559e616ff4c361 total_active=22 kind=outgoing peer_enode=enode://1cf7db662897dcd2af946fdd47e5b6d8273846e6a6bef4f1f3ec1f2b38c77c6a2382fb9409e43681c15cc043298d9476f2faa14c1adb805c15559e616ff4c361@3.10.19.215:30318
2024-06-24T20:45:43.484507Z DEBUG net: Session established remote_addr=3.10.19.215:30306 client_version=Nodecrawler/v1.14.0-unstable/linux-amd64/go1.22.2 peer_id=0x130bce4e70347c182fee96d2e09406af534db976499b2416f308fcd040814f6c6fe77e7afdec071f8a4b154976991773a34c1dcd48efe62b5f9804517ebd3b39 total_active=23 kind=outgoing peer_enode=enode://130bce4e70347c182fee96d2e09406af534db976499b2416f308fcd040814f6c6fe77e7afdec071f8a4b154976991773a34c1dcd48efe62b5f9804517ebd3b39@3.10.19.215:30306
2024-06-24T20:45:43.640366Z DEBUG net::session: failed to receive message err=disconnected remote_peer_id=0x1cf7db662897dcd2af946fdd47e5b6d8273846e6a6bef4f1f3ec1f2b38c77c6a2382fb9409e43681c15cc043298d9476f2faa14c1adb805c15559e616ff4c361
2024-06-24T20:45:43.728681Z DEBUG net::session: failed to receive message err=disconnected remote_peer_id=0x130bce4e70347c182fee96d2e09406af534db976499b2416f308fcd040814f6c6fe77e7afdec071f8a4b154976991773a34c1dcd48efe62b5f9804517ebd3b39
2024-06-24T20:45:43.769540Z DEBUG net: Session established remote_addr=49.13.11.197:30310 client_version=Nodecrawler/v1.14.0-unstable/linux-arm64/go1.22.3 peer_id=0x43bc2ef1a9df6506e064592d6755d74675dde3cf98b7f0a3878b468e530e2d7abfe728bf7a7087c1bb475bfa742511b4a5e5a22c1ac2bf401e0821b562fbe1ae total_active=22 kind=outgoing peer_enode=enode://43bc2ef1a9df6506e064592d6755d74675dde3cf98b7f0a3878b468e530e2d7abfe728bf7a7087c1bb475bfa742511b4a5e5a22c1ac2bf401e0821b562fbe1ae@49.13.11.197:30310
2024-06-24T20:45:44.030196Z DEBUG net::session: failed to receive message err=disconnected remote_peer_id=0x43bc2ef1a9df6506e064592d6755d74675dde3cf98b7f0a3878b468e530e2d7abfe728bf7a7087c1bb475bfa742511b4a5e5a22c1ac2bf401e0821b562fbe1ae
2024-06-24T20:45:44.479127Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:44.482399Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:44.482633Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-06-24T20:45:46.601091Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:46.603673Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:46.603847Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-06-24T20:45:47.244385Z DEBUG net::session: pending session timed out remote_addr=34.34.14.134:9003 direction=Outgoing(0xb20ee335f0ac45a4cd07199bc3f646c1a43bf429636d88b016ca79c8c7732b6c97097c7d049e975f5447daa007d464c6f8fcb7e3836f3d38d651b2f7b44a3b22)
2024-06-24T20:45:48.853447Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:48.857155Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:48.857460Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-06-24T20:45:50.600960Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:50.603859Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:50.604063Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-06-24T20:45:52.883919Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:52.887247Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:52.887454Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-06-24T20:45:54.336882Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:54.340350Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:54.340624Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-06-24T20:45:56.501678Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:56.505159Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:56.505498Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-06-24T20:45:57.244234Z DEBUG net::session: pending session timed out remote_addr=168.119.88.180:9222 direction=Outgoing(0x20f4d127b8cd2570205596bc34c004f1e26ea4c750e02d14a48d6da98760a84f16ce9dab7781bdf57e6d50b0f7428f68e4a25ba46a9d94ae243c3f6da230a24a)
2024-06-24T20:45:58.209904Z DEBUG net: Session established remote_addr=3.10.19.215:30309 client_version=Nodecrawler/v1.14.0-unstable/linux-amd64/go1.22.2 peer_id=0xdf7ac5ee488b74924bdb861350f174764014ded895332e61d7d8f7672d5b8421e73549bb9b71f0ec2509d179eceac9f5673cb3f69bc5ae2008da3633b17b22bd total_active=22 kind=outgoing peer_enode=enode://df7ac5ee488b74924bdb861350f174764014ded895332e61d7d8f7672d5b8421e73549bb9b71f0ec2509d179eceac9f5673cb3f69bc5ae2008da3633b17b22bd@3.10.19.215:30309
2024-06-24T20:45:58.477767Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:58.480603Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:45:58.480826Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-06-24T20:45:58.531374Z DEBUG net::session: failed to receive message err=disconnected remote_peer_id=0xdf7ac5ee488b74924bdb861350f174764014ded895332e61d7d8f7672d5b8421e73549bb9b71f0ec2509d179eceac9f5673cb3f69bc5ae2008da3633b17b22bd
2024-06-24T20:46:00.247808Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:46:00.250745Z DEBUG jsonrpsee-server: Accepting new connection 1/500
2024-06-24T20:46:00.250952Z DEBUG consensus::engine::hooks: Next hook is not ready hook="StaticFile"
2024-06-24T20:46:00.415017Z DEBUG net: Session established remote_addr=49.13.11.197:30308 client_version=Nodecrawler/v1.14.0-unstable/linux-arm64/go1.22.3 peer_id=0xf930e4054447c5062e49e1fef45fc10e0a9fbe2d8d150a3ef2008146c93a338f08abaa12fd93658f1cb60e7726d51c48aa60e3355c301416d89517bdd44e2ae4 total_active=22 kind=outgoing peer_enode=enode://f930e4054447c5062e49e1fef45fc10e0a9fbe2d8d150a3ef2008146c93a338f08abaa12fd93658f1cb60e7726d51c48aa60e3355c301416d89517bdd44e2ae4@49.13.11.197:30308
2024-06-24T20:46:00.521652Z DEBUG net: Session established remote_addr=3.10.19.215:30310 client_version=Nodecrawler/v1.14.0-unstable/linux-amd64/go1.22.2 peer_id=0x6b5af7bc909b209c2a3be9bbb5a11bb49aa833dd62e8f153c38c0ccea5e5090c5604a99f7c92b072792357f5ad803ea62ee69115378417281a3515e7b5543936 total_active=23 kind=outgoing peer_enode=enode://6b5af7bc909b209c2a3be9bbb5a11bb49aa833dd62e8f153c38c0ccea5e5090c5604a99f7c92b072792357f5ad803ea62ee69115378417281a3515e7b5543936@3.10.19.215:30310
2024-06-24T20:46:04.138963Z  INFO reth::cli: Starting reth version="0.2.0-beta.9 (7b435e0)"
2024-06-24T20:46:04.140154Z  INFO reth::cli: Opening database path="/data/db"


### Platform(s)

Linux (x86)

### What version/commit are you on?

Unable to retrieve directly due to Docker setup; using image ghcr.io/paradigmxyz/op-reth:v1.0.0 as specified in the Docker configuration.

### What database version are you on?

I am using the `op-reth` Docker image tagged as 'latest' at the time of setup. The database version is inferred to align with this image version, as no specific database version command output is accessible within the Docker environment.

### Which chain / network are you on?

Running on the 'base' network as specified in the Docker compose command line arguments.

### What type of node are you running?

Full via --full flag

### What prune config do you use, if any?

_No response_

### If you've built Reth from source, provide the full command you used

_No response_

### Code of Conduct

- [X] I agree to follow the Code of Conduct
Rjected commented 1 week ago

Can you try running with --metrics and grafana, and checking the "jemalloc memory" section?