testinprod-io / op-erigon

Optimism implementation on the efficiency frontier
https://op-erigon.testinprod.io
GNU Lesser General Public License v3.0
87 stars 15 forks source link

Impossible to Spin-Up a BASE/OP Node - Execution service is busy, cannot assemble blocks & wrong head block #170

Closed andreclaro closed 5 months ago

andreclaro commented 6 months ago

Summary

Update (2024-05-22): We are also seeing these issues in nodes syncing from genesis as well.

In the past few days, I have been trying with the support an op-erigon developer to spin-up nodes for OP and Base mainnet from snapshots they provide (https://snapshot.testinprod.io/). However the nodes are unable to quickly sync, due to very slow sync, to the point where the node falls further behind the latest block.

The warnings and errors message below are also observed on other nodes syncing from genesis.

System information

Versions:

OS & Version:

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:    22.04
Codename:   jammy

Erigon Command (with flags/config):

    --chain op-mainnet \
    --datadir /var/lib/optimism/data \
    --snapshots=false \
    --torrent.port 42069 \
    --torrent.download.rate 1000mb \
    --port 30303 \
    --http \
    --http.addr 0.0.0.0 \
    --http.port 8545 \
    --http.vhosts "*" \
    --http.api=web3,debug,eth,net,engine,erigon \
    --ws \
    --ws.port 8546 \
    --authrpc.addr 127.0.0.1 \
    --authrpc.vhosts="*" \
    --authrpc.port 8551 \
    --authrpc.jwtsecret /var/lib/optimism/jwt.txt \
    --rollup.sequencerhttp https://mainnet-sequencer.optimism.io \
    --db.size.limit 8TB \
    --txpool.gossip.disable=true \
    --nodiscover \
    --p2p.allowed-ports=46303,46304,46305 \
    --p2p.protocol=66,67,68 \
    --metrics \
    --metrics.addr 0.0.0.0 \
    --metrics.port=6060 \
    --verbosity 3 \
    --log.dir.disable

Consensus Layer: op-node

Consensus Layer Command (with flags/config):

    --l1=http://IP_ADDRESS:8545 \
    --l1.beacon=http://IP_ADDRESS:5052 \
    --l2=http://127.0.0.1:8551 \
    --p2p.disable \
    --rpc.addr=127.0.0.1 \
    --rpc.port=9545 \
    --l2.jwt-secret=/var/lib/optimism/jwt.txt \
    --network=op-mainnet \
    --p2p.peerstore.path=/var/lib/optimism/data/opnode_peerstore_db \
    --p2p.priv.path=/var/lib/optimism/data/opnode_p2p_priv.txt \
    --p2p.discovery.path=/var/lib/optimism/data/opnode_discovery_db \
    --rollup.load-protocol-versions=true \
    --rollup.halt=major

Chain/Network: Optimism Mainnet, Sepolia, Base Mainnet and Sepolia

Expected behaviour

Syncing fast without errors.

Actual behaviour

Getting errors on op-erigon and op-node.

Looking at the source code, logs and OP stack specification here is what it is going on:

This issue repeats continuously, causing the synchronization from the snapshot to be very very slow, to the point where the node falls further behind the latest block.

As recommended by the op-erigon team, different settings were tried (disable snapshots and enable snap.keepblocks), and the same software versions used on the node where the snapshot was taken (op_erigon v2.55.1-0.4.3 and op-node v1.7.0) were used, but without any improvement observed.

Steps to reproduce the behaviour

The quickest way to reproduce this issue is to use one of the latest snapshots available at https://snapshot.testinprod.io and try to spin-up a new node using op-erigon and op-node

Logs

OP-Erigon:

May 19 12:21:21 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:21+0000 lvl=warn msg="Failed to share forkchoice-updated signal" state="&{HeadBlockHash:0xa1d4ccf32c6d2eb88f9e20761600312198176995a6a538af96d533399ee01cb4 SafeBlockHash:0x2e917025bf4e285e18db4ae084ecf5613a5d3abf57fd538d246b2e13460b8d07 FinalizedBlockHash:0x2e917025bf4e285e18db4ae084ecf5613a5d3abf57fd538d246b2e13460b8d07}" err="forkChoiceUpdated timeout"
May 19 12:21:21 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:21+0000 lvl=warn msg="Derivation process temporary error" attempts=1 err="engine stage failed: temp: temporarily cannot insert new safe block: failed to complete building on top of L2 chain 0xbce4dfb53b04b1c369f0b52f00c8d3be128eefbb40f96606c34c8c71e7ec9e73:120231965, id: 0x0000000000000098, error (1): failed to make the new L2 block canonical via forkchoice: unrecognized rpc error: forkChoiceUpdated timeout"
May 19 12:21:21 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:21+0000 lvl=info msg="Optimistically queueing unsafe L2 execution payload" id=0x22a33db6987a8e2df989e7a67194728fc06095d1f0a7a2e9b3e554225d52cdcc:120261249
May 19 12:21:21 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:21+0000 lvl=info msg="Optimistically queueing unsafe L2 execution payload" id=0x953a590945fba914f7df9ead13e108f831857e6b4e38ea623e0f14f99207bf59:120261250
May 19 12:21:21 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:21+0000 lvl=info msg="Optimistically queueing unsafe L2 execution payload" id=0xeeacb3304b57e6f5a9b4125c27f34c1bce939846921dd86f06c8f0be0d2cf3c1:120261251
May 19 12:21:21 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:21+0000 lvl=info msg="Optimistically queueing unsafe L2 execution payload" id=0xf790064bfe97e6817bb765261d10d4f88738421373525396699ffe014fd1d24c:120261252
May 19 12:21:22 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:22+0000 lvl=info msg="attempting connection" peer=16Uiu2HAkyVtU5nLwvJBin5fGE1gMUwaHu85JPDoSJEpXqZdxDpEh
May 19 12:21:22 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:22+0000 lvl=info msg="attempting connection" peer=16Uiu2HAmHMYramwnf5zoHPnZiNvnevGVF4d2HWNcBFG66qTChcPv
May 19 12:21:22 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:22+0000 lvl=info msg="attempting connection" peer=16Uiu2HAm4GaUgm3B8pXfSuaBzPVNAubBirKzHaX6KMdoaVS6Xaqb
May 19 12:21:22 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:22+0000 lvl=info msg="connected to peer" peer=16Uiu2HAmHMYramwnf5zoHPnZiNvnevGVF4d2HWNcBFG66qTChcPv addr=/ip4/173.231.40.194/tcp/31367
May 19 12:21:22 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:22+0000 lvl=info msg="Starting P2P sync client event loop" peer=16Uiu2HAmHMYramwnf5zoHPnZiNvnevGVF4d2HWNcBFG66qTChcPv
May 19 12:21:22 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:22+0000 lvl=info msg="attempting connection" peer=16Uiu2HAmFgqH7o7fX9DdgndzhdHvgnpj6dXW1cExjNA6c7gqE8Lq
May 19 12:21:22 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:22+0000 lvl=info msg="attempting connection" peer=16Uiu2HAm2opx9yxr2GN4XNoRW7BN35CczFgpKKseX1NLEYWbok8L
May 19 12:21:23 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:23+0000 lvl=info msg="Received signed execution payload from p2p" id=0x0faff974e237d06c2b776180ceb1c2fef675449052f286785b1cacd018e9d397:120261253 peer=16Uiu2HAm3HuLhyVEbwLts1LLinY9wxjCKMrfK1mcmKir1LcbC3Sn
May 19 12:21:23 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:23+0000 lvl=info msg="Optimistically queueing unsafe L2 execution payload" id=0x0faff974e237d06c2b776180ceb1c2fef675449052f286785b1cacd018e9d397:120261253
May 19 12:21:23 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:23+0000 lvl=warn msg="did not finish previous block building, starting new building now" prev_onto=0xbce4dfb53b04b1c369f0b52f00c8d3be128eefbb40f96606c34c8c71e7ec9e73:120231965 prev_payload_id=0x0000000000000098 new_onto=0xbce4dfb53b04b1c369f0b52f00c8d3be128eefbb40f96606c34c8c71e7ec9e73:120231965
May 19 12:21:23 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:23+0000 lvl=warn msg="Failed to share forkchoice-updated signal" state="&{HeadBlockHash:0xbce4dfb53b04b1c369f0b52f00c8d3be128eefbb40f96606c34c8c71e7ec9e73 SafeBlockHash:0x2e917025bf4e285e18db4ae084ecf5613a5d3abf57fd538d246b2e13460b8d07 FinalizedBlockHash:0x2e917025bf4e285e18db4ae084ecf5613a5d3abf57fd538d246b2e13460b8d07}" err="[ForkChoiceUpdated]: execution service is busy, cannot assemble blocks"
May 19 12:21:23 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:23+0000 lvl=warn msg="Derivation process temporary error" attempts=2 err="engine stage failed: temp: temporarily cannot insert new safe block: failed to create new block via forkchoice: unrecognized rpc error: [ForkChoiceUpdated]: execution service is busy, cannot assemble blocks"
May 19 12:21:23 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:23+0000 lvl=info msg="connected to peer" peer=16Uiu2HAmFgqH7o7fX9DdgndzhdHvgnpj6dXW1cExjNA6c7gqE8Lq addr=/ip4/15.235.216.53/tcp/9222
May 19 12:21:23 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:23+0000 lvl=info msg="Starting P2P sync client event loop" peer=16Uiu2HAmFgqH7o7fX9DdgndzhdHvgnpj6dXW1cExjNA6c7gqE8Lq
May 19 12:21:24 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:24+0000 lvl=info msg="attempting connection" peer=16Uiu2HAmKDbQcesqGYpJhV1BuXg73X9XjkTZKV2XEYL8zRaYHwEi
May 19 12:21:25 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:25+0000 lvl=info msg="Received signed execution payload from p2p" id=0xd3bbec2ec8bf1ee02c5b1bec81147402df864a108a8b5fec7f8c9e39c8f1c042:120261254 peer=16Uiu2HAm3HuLhyVEbwLts1LLinY9wxjCKMrfK1mcmKir1LcbC3Sn
May 19 12:21:25 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:25+0000 lvl=info msg="Optimistically queueing unsafe L2 execution payload" id=0xd3bbec2ec8bf1ee02c5b1bec81147402df864a108a8b5fec7f8c9e39c8f1c042:120261254
May 19 12:21:27 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:27+0000 lvl=info msg="Received signed execution payload from p2p" id=0x80f8ca7b9d87378397f17633035a0818b1d30899ae97f6537a6abb6ae6fe1bbf:120261255 peer=16Uiu2HAm3HuLhyVEbwLts1LLinY9wxjCKMrfK1mcmKir1LcbC3Sn
May 19 12:21:27 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:27+0000 lvl=info msg="Optimistically queueing unsafe L2 execution payload" id=0x80f8ca7b9d87378397f17633035a0818b1d30899ae97f6537a6abb6ae6fe1bbf:120261255
May 19 12:21:27 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:27+0000 lvl=warn msg="did not finish previous block building, starting new building now" prev_onto=0xbce4dfb53b04b1c369f0b52f00c8d3be128eefbb40f96606c34c8c71e7ec9e73:120231965 prev_payload_id=0x0000000000000098 new_onto=0xbce4dfb53b04b1c369f0b52f00c8d3be128eefbb40f96606c34c8c71e7ec9e73:120231965
May 19 12:21:27 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:27+0000 lvl=info msg="inserted block" hash=0xa1d4ccf32c6d2eb88f9e20761600312198176995a6a538af96d533399ee01cb4 number=120231966 state_root=0x5730504d51ec32f2a022b31e51864c1d22ef8395d37d3c19c068ac0f8a6efe4a timestamp=1716062709 parent=0xbce4dfb53b04b1c369f0b52f00c8d3be128eefbb40f96606c34c8c71e7ec9e73 prev_randao=0x5b48117debe3dc38f1ce308b92d1c6420d047723b236c056820f6eefd8b33b98 fee_recipient=0x4200000000000000000000000000000000000011 txs=2 update_safe=false
May 19 12:21:27 m-optimism-02 op-node[1577]: t=2024-05-19T12:21:27+0000 lvl=info msg="Sync progress" reason="processed safe block derived from L1" l2_finalized=0x2e917025bf4e285e18db4ae084ecf5613a5d3abf57fd538d246b2e13460b8d07:120231840 l2_safe=0x2e917025bf4e285e18db4ae084ecf5613a5d3abf57fd538d246b2e13460b8d07:120231840 l2_pending_safe=0xa1d4ccf32c6d2eb88f9e20761600312198176995a6a538af96d533399ee01cb4:120231966 l2_unsafe=0xa1d4ccf32c6d2eb88f9e20761600312198176995a6a538af96d533399ee01cb4:120231966 l2_time=1716062709 l1_derived=0x54c79e434318300856f68eae64e3da898b2c3c282d655e8e00b9f579004d87d0:19899128

OP-Node:

ay 19 12:21:20 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:20.341] [updateForkchoice] Fork choice update: flushing in-memory state (built by previous newPayload)
May 19 12:21:21 m-optimism-02 erigon[1576]: [WARN] [05-19|12:21:21.341] [rpc] served                             conn=127.0.0.1:37458 method=engine_forkchoiceUpdatedV3 reqid=3166 t=1.000392006s err="forkChoiceUpdated timeout"
May 19 12:21:23 m-optimism-02 erigon[1576]: [WARN] [05-19|12:21:23.555] [rpc] served                             conn=127.0.0.1:37458 method=engine_forkchoiceUpdatedV3 reqid=3167 t=180.32µs err="[ForkChoiceUpdated]: execution service is busy, cannot assemble blocks"
May 19 12:21:23 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:23.744] RPC Daemon notified of new headers       from=120231965 to=120231966 hash=0xa1d4ccf32c6d2eb88f9e20761600312198176995a6a538af96d533399ee01cb4 header sending=19.11µs log sending=220ns
May 19 12:21:23 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:23.744] head updated                             hash=0xa1d4ccf32c6d2eb88f9e20761600312198176995a6a538af96d533399ee01cb4 number=120231966
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.803] [ForkChoiceUpdated] BlockBuilder added   payload=153
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.804] Building block...
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.804] Received GetPayloadV3                    payloadId=153
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.806] stage running - force txs, with params   txs=2 notxpool=true
May 19 12:21:27 m-optimism-02 erigon[1576]: [WARN] [05-19|12:21:27.806] wrong head block                         current=0xa1d4ccf32c6d2eb88f9e20761600312198176995a6a538af96d533399ee01cb4 requested=0xbce4dfb53b04b1c369f0b52f00c8d3be128eefbb40f96606c34c8c71e7ec9e73 executionAt=120231966
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.806] [5/7 HashState] Unwinding started        from=120231966 to=120231965 storage=false codes=true
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.806] [5/7 HashState] Unwinding started        from=120231966 to=120231965 storage=false codes=false
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.807] [5/7 HashState] Unwinding started        from=120231966 to=120231965 storage=true codes=false
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.808] [6/7 IntermediateHashes] Unwinding       from=120231966 to=120231965 csbucket=AccountChangeSet
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.809] [6/7 IntermediateHashes] Unwinding       from=120231966 to=120231965 csbucket=StorageChangeSet
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.853] [6/7 IntermediateHashes] Trie root       hash=0x13aa0422d4c4e747a322e4cc0ea236a9717093a7d5b62fbd1c18d307042aa605
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.857] [4/7 MiningExecution] Unwind Mining Execution from=120231966 to=120231965
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.858] stage running - force txs, with params   txs=2 notxpool=true
May 19 12:21:27 m-optimism-02 erigon[1576]: [WARN] [05-19|12:21:27.858] wrong head block                         current=0xa1d4ccf32c6d2eb88f9e20761600312198176995a6a538af96d533399ee01cb4 requested=0xbce4dfb53b04b1c369f0b52f00c8d3be128eefbb40f96606c34c8c71e7ec9e73 executionAt=120231966
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.858] [2/7 MiningCreateBlock] Start mine       block=120231966 baseFee=3995240 gasLimit=30000000
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.914] Built block                              hash=0xa1d4ccf32c6d2eb88f9e20761600312198176995a6a538af96d533399ee01cb4 height=120231966 txs=2 gas used %=56.749 time=110.420412ms
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.918] [ForkChoiceUpdated] BlockBuilder added   payload=154
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.918] Building block...
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.918] Received GetPayloadV3                    payloadId=154
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.920] stage running - force txs, with params   txs=7 notxpool=true
May 19 12:21:27 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:27.920] [2/7 MiningCreateBlock] Start mine       block=120231967 baseFee=4033672 gasLimit=30000000
May 19 12:21:28 m-optimism-02 erigon[1576]: [INFO] [05-19|12:21:28.108] Built block                              hash=0x36c6644774dba9358c9756e326f5120eed84cf20b628d6d0049abaf261bfd61c height=120231967 txs=7 gas used %=2.954 time=189.83558ms

Source code related to the logs messages

andreclaro commented 6 months ago

Update: We are seeing these issues in nodes syncing from genesis as well.

harshsingh-cs commented 5 months ago

We are facing a similar issue op-node

     - command:
        - /app/op-node
        - --network=base-mainnet
        - --l1=http://
        - --l1.beacon=http://
        - --l2=ws://localhost:8551
        - --l2.jwt-secret=/home/node/jwt-secret.txt
        - --rpc.addr=0.0.0.0
        - --rpc.port=9545
        - --p2p.priv.path=/home/node/p2p-private-key.txt
        - --p2p.peerstore.path=memory
        - --p2p.discovery.path=memory
        - --metrics.enabled
        - --l1.trustrpc=true

op-erigon

    - command:
        - erigon
        - --datadir=/home/node/data
        - --db.pagesize=4096
        - --ethash.dagdir=/home/node/data/dag
        - --db.size.limit=8TB
        - --chain=base-mainnet
        - --networkid=8453
        - --authrpc.jwtsecret=/home/node/jwt-secret.txt
        - --authrpc.vhosts=*
        - --http
        - --http.addr=0.0.0.0
        - --http.corsdomain=null
        - --http.vhosts=*
        - --http.api=eth,erigon,net,web3,debug,trace
        - --ws
        - --rpc.batch.limit=1000
        - --rpc.returndata.limit=150000
        - --rollup.sequencerhttp=https://mainnet-sequencer.base.org
        - --rollup.historicalrpc=
        - --rollup.historicalrpctimeout=5s
        - --rpc.txfeecap=100
        - --maxpeers=100
        - --nodiscover
        - --identity=
        - --rollup.disabletxpoolgossip=true
        - --metrics
        - --metrics.addr=0.0.0.0
        - --healthcheck
        - --private.api.addr=0.0.0.0:9090
        - --private.api.ratelimit=63744
        - --db.size.limit=8TB
        - --rpc.batch.concurrency=4
        - --db.read.concurrency=4096
        - --rpc.batch.limit=200
        - --rpc.returndata.limit=200000

Erigon Logs

[INFO] [06-03|17:19:45.686] [updateForkchoice] Fork choice update: flushing in-memory state (built by previous newPayload)
[WARN] [06-03|17:19:46.686] [rpc] served                             method=engine_forkchoiceUpdatedV3 reqid=2901 t=1.000723139s err="forkChoiceUpdated timeout"
[WARN] [06-03|17:19:48.792] [rpc] served                             method=engine_forkchoiceUpdatedV3 reqid=2902 t=1.623159ms err="cannot update forkchoice. execution service is busy"
[INFO] [06-03|17:19:49.956] RPC Daemon notified of new headers       from=15182419 to=15182420 amount=1 hash=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0 header sending=57.651µs log sending=460ns
[INFO] [06-03|17:19:49.956] head updated                             hash=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0 number=15182420
[INFO] [06-03|17:19:52.970] Optimism ForkChoice is choosing to unwind to a previously canonical block blockHash=0x31e63522b043c6b753920ff39fc9e83db3cf568c18f81f5083c23b54226c38d2 blockNumber=15182419 headHash=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0
[WARN] [06-03|17:19:53.970] [rpc] served                             method=engine_forkchoiceUpdatedV3 reqid=2903 t=1.001794509s err="forkChoiceUpdated timeout"
[INFO] [06-03|17:19:54.413] [5/12 HashState] Unwinding started       from=15182420 to=15182419 storage=false codes=true
[INFO] [06-03|17:19:54.415] [5/12 HashState] Unwinding started       from=15182420 to=15182419 storage=false codes=false
[INFO] [06-03|17:19:54.417] [5/12 HashState] Unwinding started       from=15182420 to=15182419 storage=true codes=false
[INFO] [06-03|17:19:54.429] [6/12 IntermediateHashes] Unwinding      from=15182420 to=15182419 csbucket=AccountChangeSet
[INFO] [06-03|17:19:54.430] [6/12 IntermediateHashes] Unwinding      from=15182420 to=15182419 csbucket=StorageChangeSet
[INFO] [06-03|17:19:54.473] [6/12 IntermediateHashes] Trie root      hash=0xee86369693adf23ac3a5c0735ed913f4b6e40ae8c213c06545bb47b18597a480
[INFO] [06-03|17:19:54.492] [4/12 Execution] Unwind Execution        from=15182420 to=15182419
[INFO] [06-03|17:19:54.574] RPC Daemon notified of new headers       from=15181395 to=15182419 amount=1024 hash=0x31e63522b043c6b753920ff39fc9e83db3cf568c18f81f5083c23b54226c38d2 header sending=5.072838ms log sending=590ns
[INFO] [06-03|17:19:54.574] head updated                             hash=0x31e63522b043c6b753920ff39fc9e83db3cf568c18f81f5083c23b54226c38d2 number=15182419
[INFO] [06-03|17:20:02.200] [ForkChoiceUpdated] duplicate build request
[INFO] [06-03|17:20:02.201] Received GetPayloadV3                    payloadId=1
[INFO] [06-03|17:20:02.218] [NewPayload] Handling new payload        height=15182420 hash=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0
[INFO] [06-03|17:20:02.421] [4/12 Execution] Completed on            block=15182420
[INFO] [06-03|17:20:02.601] RPC Daemon notified of new headers       from=15182419 to=15182420 amount=1 hash=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0 header sending=50.891µs log sending=940ns
[INFO] [06-03|17:20:02.602] head updated                             hash=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0 number=15182420
[WARN] [06-03|17:20:03.221] [rpc] served                             method=engine_forkchoiceUpdatedV3 reqid=2907 t=1.00134398s err="forkChoiceUpdated timeout"

op-node Logs

t=2024-06-03T17:19:48+0000 lvl=warn msg="Derivation process temporary error" attempts=2 err="engine stage failed: temp: temporarily cannot insert new safe block: failed to create new block via forkchoice: unrecognized rpc error: cannot update forkchoice. execution service is busy"
t=2024-06-03T17:20:01+0000 lvl=info msg="Optimistically queueing unsafe L2 execution payload" id=0x355ea51a58846846a3a069569bbb3cb121e2b60f0a777d196a5199e4a08a0c6b:15322927
t=2024-06-03T17:20:02+0000 lvl=warn msg="did not finish previous block building, starting new building now" prev_onto=0x31e63522b043c6b753920ff39fc9e83db3cf568c18f81f5083c23b54226c38d2:15182419 prev_payload_id=0x0000000000000001 new_onto=0x31e63522b043c6b753920ff39fc9e83db3cf568c18f81f5083c23b54226c38d2:15182419
mininny commented 5 months ago

@andreclaro @harshsingh-cs Thanks for reporting this issue so thoroughly guys. I'm trying to reproduce this issue on my own node with the same configurations, but it's still being synced to the chain head.

I'll be able to debug more once I have nodes with this same issue, but for now, I'm looking over the codebase only. I did find some potential bug in the codebase which might be related to this, but I can't make any guarantees that this will fix this issue.

Do you guys have any spare nodes that you can play around with? Can you check out feature/mininny/synchronous-fcu branch and build the op-erigon node and see if the issue gets fixed by any chance?

If you don't have any development nodes, please give me a few days to sync and reproduce this issue on my end 😄

andreclaro commented 5 months ago

@mininny, I can give it a try on Optimism mainnet but I have this other issue https://github.com/testinprod-io/op-erigon/issues/171

harshsingh-cs commented 5 months ago

@mininny I tried the branch but it did not help

op-erigon logs

97ms
[INFO] [06-07|10:39:48.094] [NewPayload] Handling new payload        height=15182421 hash=0xe722b1b3e59325df589e75336b06f88e82626e97f5865d13e816b68f2c5ca73e
[INFO] [06-07|10:39:48.259] [updateForkchoice] Fork choice update: flushing in-memory state (built by previous newPayload)
[INFO] [06-07|10:39:48.463] RPC Daemon notified of new headers       from=15182420 to=15182421 amount=1 hash=0xe722b1b3e59325df589e75336b06f88e82626e97f5865d13e816b68f2c5ca73e header sending=47.642µs log sending=730ns
[INFO] [06-07|10:39:48.463] head updated                             hash=0xe722b1b3e59325df589e75336b06f88e82626e97f5865d13e816b68f2c5ca73e number=15182421
[WARN] [06-07|10:39:49.258] [rpc] served                             method=engine_forkchoiceUpdatedV3 reqid=2903 t=1.000708594s err="forkChoiceUpdated timeout"
[INFO] [06-07|10:39:51.299] Optimism ForkChoice is choosing to unwind to a previously canonical block blockHash=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0 blockNumber=15182420 headHash=0xe722b1b3e59325df589e75336b06f88e82626e97f5865d13e816b68f2c5ca73e
[INFO] [06-07|10:39:51.363] [5/12 HashState] Unwinding started       from=15182421 to=15182420 storage=false codes=true
[INFO] [06-07|10:39:51.364] [5/12 HashState] Unwinding started       from=15182421 to=15182420 storage=false codes=false
[INFO] [06-07|10:39:51.367] [5/12 HashState] Unwinding started       from=15182421 to=15182420 storage=true codes=false
[INFO] [06-07|10:39:51.376] [6/12 IntermediateHashes] Unwinding      from=15182421 to=15182420 csbucket=AccountChangeSet
[INFO] [06-07|10:39:51.377] [6/12 IntermediateHashes] Unwinding      from=15182421 to=15182420 csbucket=StorageChangeSet
[INFO] [06-07|10:39:51.426] [6/12 IntermediateHashes] Trie root      hash=0x4bac6f4a92a8b92902ccc6bc4136349789dcee071de7be00e6e0a675b02e10b0
[INFO] [06-07|10:39:51.445] [4/12 Execution] Unwind Execution        from=15182421 to=15182420
[INFO] [06-07|10:39:51.579] RPC Daemon notified of new headers       from=15181396 to=15182420 amount=1024 hash=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0 header sending=2.127271ms log sending=560ns
[INFO] [06-07|10:39:51.579] head updated                             hash=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0 number=15182420
[WARN] [06-07|10:39:52.299] [rpc] served                             method=engine_forkchoiceUpdatedV3 reqid=2904 t=1.001689254s err="forkChoiceUpdated timeout"
[INFO] [06-07|10:39:56.456] [ForkChoiceUpdated] duplicate build request
[INFO] [06-07|10:39:56.457] Received GetPayloadV3                    payloadId=1
[INFO] [06-07|10:39:56.468] [NewPayload] Handling new payload        height=15182421 hash=0xe722b1b3e59325df589e75336b06f88e82626e97f5865d13e816b68f2c5ca73e
[INFO] [06-07|10:39:56.554] [4/12 Execution] Completed on            block=15182421
[INFO] [06-07|10:39:56.748] RPC Daemon notified of new headers       from=15182420 to=15182421 amount=1 hash=0xe722b1b3e59325df589e75336b06f88e82626e97f5865d13e816b68f2c5ca73e header sending=44.282µs log sending=500ns
[INFO] [06-07|10:39:56.748] head updated                             hash=0xe722b1b3e59325df589e75336b06f88e82626e97f5865d13e816b68f2c5ca73e number=15182421
[WARN] [06-07|10:39:57.470] [rpc] served                             method=engine_forkchoiceUpdatedV3 reqid=2908 t=1.000588474s err="forkChoiceUpdated timeout"
[INFO] [06-07|10:40:05.653] Optimism ForkChoice is choosing to unwind to a previously canonical block blockHash=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0 blockNumber=15182420 headHash=0xe722b1b3e59325df589e75336b06f88e82626e97f5865d13e816b68f2c5ca73e
[INFO] [06-07|10:40:05.728] [5/12 HashState] Unwinding started       from=15182421 to=15182420 storage=false codes=true
[INFO] [06-07|10:40:05.729] [5/12 HashState] Unwinding started       from=15182421 to=15182420 storage=false codes=false
[INFO] [06-07|10:40:05.731] [5/12 HashState] Unwinding started       from=15182421 to=15182420 storage=true codes=false
[INFO] [06-07|10:40:05.740] [6/12 IntermediateHashes] Unwinding      from=15182421 to=15182420 csbucket=AccountChangeSet
[INFO] [06-07|10:40:05.741] [6/12 IntermediateHashes] Unwinding      from=15182421 to=15182420 csbucket=StorageChangeSet
[INFO] [06-07|10:40:05.799] [6/12 IntermediateHashes] Trie root      hash=0x4bac6f4a92a8b92902ccc6bc4136349789dcee071de7be00e6e0a675b02e10b0
[INFO] [06-07|10:40:05.822] [4/12 Execution] Unwind Execution        from=15182421 to=15182420
[INFO] [06-07|10:40:05.940] RPC Daemon notified of new headers       from=15181396 to=15182420 amount=1024 hash=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0 header sending=2.051311ms log sending=450ns
[INFO] [06-07|10:40:05.940] head updated                             hash=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0 number=15182420
[WARN] [06-07|10:40:06.654] [rpc] served                             method=engine_forkchoiceUpdatedV3 reqid=2909 t=1.001553327s err="forkChoiceUpdated timeout"
[INFO] [06-07|10:40:15.525] [p2p] GoodPeers
[INFO] [06-07|10:40:15.716] [mem] memory stats                       Rss=840.2MB Size=0B Pss=840.2MB SharedClean=4.0KB SharedDirty=0B PrivateClean=571.6MB PrivateDirty=268.6MB Referenced=839.6MB Anonymous=269.1MB Swap=0B alloc=114.6MB sys=227.7MB
[INFO] [06-07|10:40:16.731] [ForkChoiceUpdated] duplicate build request
[INFO] [06-07|10:40:16.732] Received GetPayloadV3                    payloadId=1
[INFO] [06-07|10:40:16.744] [NewPayload] Handling new payload        height=15182421 hash=0xe722b1b3e59325df589e75336b06f88e82626e97f5865d13e816b68f2c5ca73e
[INFO] [06-07|10:40:16.834] [4/12 Execution] Completed on            block=15182421
[INFO] [06-07|10:40:17.049] RPC Daemon notified of new headers       from=15182420 to=15182421 amount=1 hash=0xe722b1b3e59325df589e75336b06f88e82626e97f5865d13e816b68f2c5ca73e header sending=47.62µs log sending=460ns
[INFO] [06-07|10:40:17.049] head updated                             hash=0xe722b1b3e59325df589e75336b06f88e82626e97f5865d13e816b68f2c5ca73e number=15182421
[WARN] [06-07|10:40:17.746] [rpc] served                             method=engine_forkchoiceUpdatedV3 reqid=2913 t=1.00033198s err="forkChoiceUpdated timeout"
[INFO] [06-07|10:40:22.375] [txpool] stat                            pending=0 baseFee=0 queued=0 alloc=138.7MB sys=227.7MB

op-node logs

=2024-06-07T10:41:23+0000 lvl=warn msg="did not finish previous block building, starting new building now" prev_onto=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0:15182420 prev_payload_id=0x0000000000000001 new_onto=0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0:15182420
t=2024-06-07T10:41:24+0000 lvl=warn msg="Failed to share forkchoice-updated signal" state="&{HeadBlockHash:0xe722b1b3e59325df589e75336b06f88e82626e97f5865d13e816b68f2c5ca73e SafeBlockHash:0x39a28f7e1f8ca4d99fe84c90f833f776f9fe9fa4c4378ce47f6790ac3ed6f9e8 FinalizedBlockHash:0x9a156601b3cbb5e9aaa648e91727359df008cbd33409ad754f13f7de13ed6a4b}" err="forkChoiceUpdated timeout"
t=2024-06-07T10:41:24+0000 lvl=warn msg="Derivation process temporary error" attempts=11 err="engine stage failed: temp: temporarily cannot insert new safe block: failed to complete building on top of L2 chain 0xf8b051f57016879065ca3d445a71a148aba80881543e30b1e107f06e4ecb68e0:15182420, id: 0x0000000000000001, error (1): failed to make the new L2 block canonical via forkchoice: unrecognized rpc error: forkChoiceUpdated timeout"

By the way I tried this on a node which was already broken

ImTei commented 5 months ago

@andreclaro @harshsingh-cs We have increased the FCU timeout of op-erigon to the same value of op-node's FCU timeout. Can you try the latest release? https://github.com/testinprod-io/op-erigon/releases/tag/v2.60.0-0.6.2

harshsingh-cs commented 5 months ago

It helped, the node which was not syncing is now syncing Thanks @ImTei for you help !!

ncavedale-xlabs commented 5 months ago

Hi! We tried upgrading to v2.60.0-0.6.2 but now we're getting more of these errors in op-node:

Jun 10 13:34:15 m-exp-optimism-01 op-node[2854036]: t=2024-06-10T13:34:15+0000 lvl=warn msg="Derivation process temporary error" attempts=16 err="engine stage failed: temp: temporarily cannot insert new safe block: failed to complete building on top of L2 chain 0x4b789cbc82e48a3ca3764b0996dce40746fc85dada9c1cd20eb7edb3cada77c4:121213540, id: 0x0000000000000050, error (1): failed to make the new L2 block canonical via forkchoice: unrecognized rpc error: forkChoiceUpdated timeout"
Jun 10 13:34:15 m-exp-optimism-01 op-node[2854036]: t=2024-06-10T13:34:15+0000 lvl=warn msg="L1 head signal indicates a possible L1 re-org" old_l1_head=0xed7c004a35533f38674f97094f67e8a49a5e040b472ac8d1eff5953a8d39bbd2:20061755 new_l1_head_parent=0xdf7e465da66cca04a29165ca9a030825f878185ac598668c6b0f75b5b028dfc8 new_l1_head=0x2f0668227ad522424676355947956bed55859a2e51f5d1de82bf2ac5fa5a1e12:20061757
Jun 10 13:34:25 m-exp-optimism-01 op-node[2854036]: t=2024-06-10T13:34:25+0000 lvl=warn msg="did not finish previous block building, starting new building now" prev_onto=0x4b789cbc82e48a3ca3764b0996dce40746fc85dada9c1cd20eb7edb3cada77c4:121213540 prev_payload_id=0x0000000000000050 new_onto=0x4b789cbc82e48a3ca3764b0996dce40746fc85dada9c1cd20eb7edb3cada77c4:121213540
Jun 10 13:34:27 m-exp-optimism-01 op-node[2854036]: t=2024-06-10T13:34:27+0000 lvl=warn msg="Failed to share forkchoice-updated signal" state="&{HeadBlockHash:0x95d21089cb18ee5cf5e24e54b156d61733ac48713cc568f1f8eeb3d12c379d88 SafeBlockHash:0xc97983cb4d924edd3ebc23a68c59844ccab3039081f0e03d4e5ac82ce9e38a2b FinalizedBlockHash:0x34ca2f4cc96fbf5ce4c53592cd859c4277199b92d29d69bc976044d9eb231daa}" err="forkChoiceUpdated timeout"
Jun 10 13:34:27 m-exp-optimism-01 op-node[2854036]: t=2024-06-10T13:34:27+0000 lvl=warn msg="Derivation process temporary error" attempts=17 err="engine stage failed: temp: temporarily cannot insert new safe block: failed to complete building on top of L2 chain 0x4b789cbc82e48a3ca3764b0996dce40746fc85dada9c1cd20eb7edb3cada77c4:121213540, id: 0x0000000000000050, error (1): failed to make the new L2 block canonical via forkchoice: unrecognized rpc error: forkChoiceUpdated timeout"
Jun 10 13:34:37 m-exp-optimism-01 op-node[2854036]: t=2024-06-10T13:34:37+0000 lvl=warn msg="did not finish previous block building, starting new building now" prev_onto=0x4b789cbc82e48a3ca3764b0996dce40746fc85dada9c1cd20eb7edb3cada77c4:121213540 prev_payload_id=0x0000000000000050 new_onto=0x4b789cbc82e48a3ca3764b0996dce40746fc85dada9c1cd20eb7edb3cada77c4:121213540
Jun 10 13:34:39 m-exp-optimism-01 op-node[2854036]: t=2024-06-10T13:34:39+0000 lvl=warn msg="Failed to share forkchoice-updated signal" state="&{HeadBlockHash:0x95d21089cb18ee5cf5e24e54b156d61733ac48713cc568f1f8eeb3d12c379d88 SafeBlockHash:0xc97983cb4d924edd3ebc23a68c59844ccab3039081f0e03d4e5ac82ce9e38a2b FinalizedBlockHash:0x34ca2f4cc96fbf5ce4c53592cd859c4277199b92d29d69bc976044d9eb231daa}" err="forkChoiceUpdated timeout"
Jun 10 13:34:39 m-exp-optimism-01 op-node[2854036]: t=2024-06-10T13:34:39+0000 lvl=warn msg="Derivation process temporary error" attempts=18 err="engine stage failed: temp: temporarily cannot insert new safe block: failed to complete building on top of L2 chain 0x4b789cbc82e48a3ca3764b0996dce40746fc85dada9c1cd20eb7edb3cada77c4:121213540, id: 0x0000000000000050, error (1): failed to make the new L2 block canonical via forkchoice: unrecognized rpc error: forkChoiceUpdated timeout"

These are the corresponding logs for op-erigon:

Jun 10 13:36:45 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:45.418] [4/12 Execution] Unwind Execution        from=121213541 to=121213540
Jun 10 13:36:45 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:45.456] RPC Daemon notified of new headers       from=121212516 to=121213540 amount=1024 hash=0x4b789cbc82e48a3ca3764b0996dce40746fc85dada9c1cd20eb7edb3cada77c4 header sending=903.172µs log sending=440ns
Jun 10 13:36:45 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:45.457] head updated                             hash=0x4b789cbc82e48a3ca3764b0996dce40746fc85dada9c1cd20eb7edb3cada77c4 number=121213540
Jun 10 13:36:46 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:46.046] [ForkChoiceUpdated] duplicate build request
Jun 10 13:36:46 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:46.047] Received GetPayloadV3                    payloadId=80
Jun 10 13:36:46 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:46.049] [NewPayload] Handling new payload        height=121213541 hash=0x95d21089cb18ee5cf5e24e54b156d61733ac48713cc568f1f8eeb3d12c379d88
Jun 10 13:36:46 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:46.069] [4/12 Execution] Completed on            block=121213541
Jun 10 13:36:46 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:46.175] RPC Daemon notified of new headers       from=121213540 to=121213541 amount=1 hash=0x95d21089cb18ee5cf5e24e54b156d61733ac48713cc568f1f8eeb3d12c379d88 header sending=20.731µs log sending=250ns
Jun 10 13:36:46 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:46.175] head updated                             hash=0x95d21089cb18ee5cf5e24e54b156d61733ac48713cc568f1f8eeb3d12c379d88 number=121213541
Jun 10 13:36:47 m-exp-optimism-01 erigon[2853887]: [WARN] [06-10|13:36:47.051] [rpc] served                             conn=127.0.0.1:60396 method=engine_forkchoiceUpdatedV3 reqid=3042 t=1.000694507s err="forkChoiceUpdated timeout"
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.274] Optimism ForkChoice is choosing to unwind to a previously canonical block blockHash=0x4b789cbc82e48a3ca3764b0996dce40746fc85dada9c1cd20eb7edb3cada77c4 blockNumber=121213540 headHash=0x95d21089cb18ee5cf5e24e54b156d61733ac48713cc568f1f8eeb3d12c379d88
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.284] [5/12 HashState] Unwinding started       from=121213541 to=121213540 storage=false codes=true
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.284] [5/12 HashState] Unwinding started       from=121213541 to=121213540 storage=false codes=false
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.284] [5/12 HashState] Unwinding started       from=121213541 to=121213540 storage=true codes=false
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.289] [6/12 IntermediateHashes] Unwinding      from=121213541 to=121213540 csbucket=AccountChangeSet
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.289] [6/12 IntermediateHashes] Unwinding      from=121213541 to=121213540 csbucket=StorageChangeSet
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.319] [6/12 IntermediateHashes] Trie root      hash=0xbd50e044154d20d6be535cdb4458f1c951f5e9142f6dd9d46da5eef46c5d254e
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.337] [4/12 Execution] Unwind Execution        from=121213541 to=121213540
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.375] RPC Daemon notified of new headers       from=121212516 to=121213540 amount=1024 hash=0x4b789cbc82e48a3ca3764b0996dce40746fc85dada9c1cd20eb7edb3cada77c4 header sending=894.272µs log sending=500ns
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.375] head updated                             hash=0x4b789cbc82e48a3ca3764b0996dce40746fc85dada9c1cd20eb7edb3cada77c4 number=121213540
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.957] [ForkChoiceUpdated] duplicate build request
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.958] Received GetPayloadV3                    payloadId=80
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.960] [NewPayload] Handling new payload        height=121213541 hash=0x95d21089cb18ee5cf5e24e54b156d61733ac48713cc568f1f8eeb3d12c379d88
Jun 10 13:36:55 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:55.980] [4/12 Execution] Completed on            block=121213541
Jun 10 13:36:56 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:56.084] RPC Daemon notified of new headers       from=121213540 to=121213541 amount=1 hash=0x95d21089cb18ee5cf5e24e54b156d61733ac48713cc568f1f8eeb3d12c379d88 header sending=18.23µs log sending=231ns
Jun 10 13:36:56 m-exp-optimism-01 erigon[2853887]: [INFO] [06-10|13:36:56.084] head updated                             hash=0x95d21089cb18ee5cf5e24e54b156d61733ac48713cc568f1f8eeb3d12c379d88 number=121213541
ImTei commented 5 months ago

@ncavedale-xlabs Are you using our official docker image? We increased the FCU timeout from 1s to 5s, but your logs said t=1.000694507s err="forkChoiceUpdated timeout". Can you check if the right version is running?

ncavedale-xlabs commented 5 months ago

My bad, I was still using the old binary (we're compiling from source). The errors are gone now 😄 , but I'll keep monitoring and let you know if I see any odd behaviour. Thanks!

cshintov commented 1 month ago

I'm facing the same issue even with the updated fcu timeout of 5!

[WARN] [10-18|06:54:35.014] [rpc] served                             conn=127.0.0.1:42218 method=engine_forkchoiceUpdatedV3 reqid=127472 t=5.000348083s err="forkChoiceUpdated timeout"
[WARN] [10-18|06:54:35.015] [rpc] served                             conn=127.0.0.1:47858 method=engine_forkchoiceUpdatedV3 reqid=127473 t=265.816µs err="cannot update forkchoice. execution service is busy"
[WARN] [10-18|06:54:35.041] [rpc] served                             conn=127.0.0.1:47858 method=engine_forkchoiceUpdatedV3 reqid=127475 t=1.023863ms err="cannot update forkchoice. execution service is busy"

Running:

      node: us-docker.pkg.dev/oplabs-tools-artifacts/images/op-node:v1.9.3
      operigon: testinprod/op-erigon:2.60.8-0.7.2

What might be the root cause of this delay? My hardware is fine, running on nvme disks with ample IOPS! No CPU/RAM pressure as well.

mininny commented 1 month ago

@cshintov Could you share more logs around the errored line? Setting console.log.verbosity=4 would help as well

cshintov commented 1 month ago

I've increased it from 5s to 30s and I had added a few more debug logs.. the problem seems to be it's always in execution.

[WARN] [10-22|07:56:10.618] [yyy] InsertBlocksAndWait: execution status busy while inserting blocks
[INFO] [10-22|07:56:10.719] [yyy] InsertBlocksAndWait                response=result:Busy err=nil
[WARN] [10-22|07:56:10.719] [yyy] InsertBlocksAndWait: execution status busy while inserting blocks
[INFO] [10-22|07:56:10.820] [yyy] InsertBlocksAndWait                response=result:Busy err=nil
[WARN] [10-22|07:56:10.820] [yyy] InsertBlocksAndWait: execution status busy while inserting blocks
[WARN] [10-22|07:56:10.899] [EngineBlockDownloader] Could not insert headers and bodies err="context deadline exceeded"
[INFO] [10-22|07:56:11.701] [EngineBlockDownloader] Downloading PoS headers... hash=0xb30d8aea84b40d5bd2bed3a7a142f9e404fa17be826519986af7f7f51e9e24c9 requestId=0
[INFO] [10-22|07:56:17.130] [11/12 TxLookup] ETL [2/2] Loading       into=BlockTransactionLookup current_prefix=bf08b135

I'm trying to run both op-mainnet and base-mainnet on the same machine as containers.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.62    0.00    2.55    2.14    0.00   94.70

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
loop0            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop1            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop10           0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop11           0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop12           0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop2            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop3            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop4            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop5            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop6            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop7            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop8            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop9            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
md2             17.60      0.30     0.00   0.00    0.27    17.18  217.80      0.85     0.00   0.00    0.09     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.02   4.56
nvme0n1       1016.60     63.53     0.00   0.00    0.27    63.99    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.27  84.56
nvme1n1       1053.20     65.82     0.00   0.00    0.26    63.99    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.28  84.96
nvme2n1       1043.80     65.22     0.00   0.00    0.26    63.98    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.28  84.56
nvme3n1       1126.40     70.38     0.00   0.00    0.26    63.99    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.29  85.04
nvme4n1       1087.00     67.92     0.00   0.00    0.26    63.99    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.29  84.48
nvme5n1       1012.40     63.26     0.00   0.00    0.27    63.98    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.27  83.52
sda              7.00      0.12     0.00   0.00    0.26    17.14  196.80      0.86    73.40  27.17    0.10     4.46    0.00      0.00     0.00   0.00    0.00     0.00    0.03   4.80
sdb             10.60      0.18     0.00   0.00    0.26    17.21  196.40      0.86    73.40  27.21    0.10     4.46    0.00      0.00     0.00   0.00    0.00     0.00    0.03   4.96

It's read heavy and utilization is around 90% always!

mininny commented 1 month ago

@cshintov Could you please clarify what you mean by increased it from 5s to 30s? Also please create another github issue so that we can discuss in a new thread.

If the disk utilization is high, then it might mean that the disk does not have enough capacity to handle both op-mainnet and base-mainnet simultaneously. Also, please provide more logs, preferably during 30-60 sec timespan.

cshintov commented 1 month ago

https://github.com/testinprod-io/op-erigon/blob/7c31b71e9e5dfd3433dcd8b97759cec0985d2932/turbo/execution/eth1/eth1_chain_reader.go/chain_reader.go#L36

FCU timeout increased from 1s to 5s earlier. For me it was timing out at 5s. So I increased it to 30s to solve it. That solved this issue.

But I'm still seeing timeouts in other parts. Will share the details.