status-im / nimbus-eth1

Nimbus: an Ethereum Execution Client for Resource-Restricted Devices
https://status-im.github.io/nimbus-eth1/
Apache License 2.0
585 stars 119 forks source link

Long running sync process blocks or shuts down RPC/engine API connection after communicating with CL #2816

Open mjfh opened 2 weeks ago

mjfh commented 2 weeks ago

Observation:

An EL client on Mainnet was started as

     nimbus_execution_client
       --data-dir:./data
       --tcp-port:30325
       --metrics-port=9099
       --metrics-address=172.16.210.1
       --metrics
       --engine-api=true
       --engine-api-port=8551
       --engine-api-ws=true
       --jwt-secret=/status/tmp/jwtsecret
       --network=mainnet
       --log-level:TRACE

at 2024-10-31 12:51:06.497. On the same machine a CL beacon node has been run as

    nimbus_beacon_node
      --network=mainnet
      --data-dir=build/data/shared_mainnet_0
      --tcp-port=9000
      --udp-port=9000
      --rest
      --rest-port=5052
      --metrics
      --web3-url=http://127.0.0.1:8551
      --jwt-secret=/status/tmp/jwtsecret

The EL client recieved Forkchoice updates which invoked the beacon sync logic. These updates come with log messages like

    DBG 2024-10-31 12:54:04.803 Received JSON-RPC request                  topics="JSONRPC-HTTP-SERVER" file=httpserver.nim:52 address=127.0.0.1:52670 len=342
    DBG 2024-10-31 12:54:04.803 Processing JSON-RPC request                file=router.nim:135 id=104 name=engine_forkchoiceUpdatedV3
    INF 2024-10-31 12:54:04.803 Forkchoice requested sync to new head      file=api_forkchoice.nim:107 number=21085479 hash=6ffc9828c615
    DBG 2024-10-31 12:54:04.803 Returning JSON-RPC response                file=router.nim:137 id=104 name=engine_forkchoiceUpdatedV3 len=38
    DBG 2024-10-31 12:54:04.803 Received JSON-RPC request                  topics="JSONRPC-HTTP-SERVER" file=httpserver.nim:52 address=127.0.0.1:52676 len=241
    DBG 2024-10-31 12:54:04.803 Processing JSON-RPC request                file=router.nim:135 id=105 name=eth_getLogs
    DBG 2024-10-31 12:54:05.283 Returning JSON-RPC response                file=router.nim:137 id=105 name=eth_getLogs len=2

The last such message block was recorded as

    DBG 2024-10-31 21:15:52.257 Processing JSON-RPC request                file=router.nim:135 id=1067 name=engine_forkchoiceUpdatedV3
    WRN 2024-10-31 21:15:52.257 Forkchoice requested unknown head          file=api_forkchoice.nim:102 hash=d4e78b0217c7
    DBG 2024-10-31 21:15:52.257 Returning JSON-RPC response                file=router.nim:137 id=1067 name=engine_forkchoiceUpdatedV3 len=38
    DBG 2024-10-31 21:15:52.257 Internal error while processing JSON-RPC call topics="JSONRPC-HTTP-SERVER" file=httpserver.nim:84
    DBG 2024-10-31 21:15:52.257 Received JSON-RPC request                  topics="JSONRPC-HTTP-SERVER" file=httpserver.nim:52 address=127.0.0.1:58212 len=138590
    DBG 2024-10-31 21:15:52.261 Processing JSON-RPC request                file=router.nim:135 id=1059 name=engine_newPayloadV3
    TRC 2024-10-31 21:15:52.261 Engine API request received                file=api_newpayload.nim:93 meth=newPayload number=21088244 hash=0x22ed6bcaba7bbcd00cc4ca901d920199745a37eae01401a3df5b55b6381c8a86
    DBG 2024-10-31 21:15:52.261 Returning JSON-RPC response                file=router.nim:137 id=1059 name=engine_newPayloadV3 len=20
    DBG 2024-10-31 21:15:52.261 Internal error while processing JSON-RPC call topics="JSONRPC-HTTP-SERVER" file=httpserver.nim:84
    DBG 2024-10-31 21:15:52.261 Received JSON-RPC request                  topics="JSONRPC-HTTP-SERVER" file=httpserver.nim:52 address=127.0.0.1:39352 len=144292
    DBG 2024-10-31 21:15:52.261 Processing JSON-RPC request                file=router.nim:135 id=1061 name=engine_newPayloadV3
    TRC 2024-10-31 21:15:52.265 Engine API request received                file=api_newpayload.nim:93 meth=newPayload number=21088245 hash=0xb18e08cad42e05c1c07becac1fbc4cf764b967c33d75d384c75dab2de85e506d
    DBG 2024-10-31 21:15:52.265 Returning JSON-RPC response                file=router.nim:137 id=1061 name=engine_newPayloadV3 len=20
    DBG 2024-10-31 21:15:52.265 Internal error while processing JSON-RPC call topics="JSONRPC-HTTP-SERVER" file=httpserver.nim:84
    DBG 2024-10-31 21:15:52.265 Received JSON-RPC request                  topics="JSONRPC-HTTP-SERVER" file=httpserver.nim:52 address=127.0.0.1:36930 len=148497
    DBG 2024-10-31 21:15:52.265 Processing JSON-RPC request                file=router.nim:135 id=1064 name=engine_newPayloadV3
    TRC 2024-10-31 21:15:52.265 Engine API request received                file=api_newpayload.nim:93 meth=newPayload number=21088246 hash=0xd4e78b9326d90e57d18ee521acb7117dbaca21bfc1d013748602b5f6b20217c7
    DBG 2024-10-31 21:15:52.269 Returning JSON-RPC response                file=router.nim:137 id=1064 name=engine_newPayloadV3 len=20
    DBG 2024-10-31 21:15:52.269 Internal error while processing JSON-RPC call topics="JSONRPC-HTTP-SERVER" file=httpserver.nim:84
    DBG 2024-10-31 21:15:52.269 Received JSON-RPC request                  topics="JSONRPC-HTTP-SERVER" file=httpserver.nim:52 address=127.0.0.1:54714 len=108015
    DBG 2024-10-31 21:15:52.269 Processing JSON-RPC request                file=router.nim:135 id=1066 name=engine_newPayloadV3
    TRC 2024-10-31 21:15:52.269 Engine API request received                file=api_newpayload.nim:93 meth=newPayload number=21088247 hash=0xd3cbf11e754f933b3ae91f31bce5afbad86e99046d2571b1fdb7b95e29c213bc
    DBG 2024-10-31 21:15:52.269 Returning JSON-RPC response                file=router.nim:137 id=1066 name=engine_newPayloadV3 len=20
    DBG 2024-10-31 21:15:52.269 Internal error while processing JSON-RPC call topics="JSONRPC-HTTP-SERVER" file=httpserver.nim:84

After that time, no message with the text JSON or Forkchoice can be found until time stamp 2024-11-01 09:31:40.929 where this log ends.

Additional information:

The syncer has stalled after finishing the last job which stopped some hours later with the message

    TRC 2024-11-01 03:05:46.920 RunDaemon: hibernating, awaiting new sync target topics="beacon sync" file=update.nim:158 L=H H=#21085479

and has stalled since (the PR branch with exactly this message might not be available on GitHub yet.)

The CL beacon node repeatedly shows messages like

    INF 2024-11-01 09:34:14.016 No execution client connected; cannot process block payloads topics="gossip_blocks" executionPayload=[..]
    INF 2024-11-01 09:34:21.650+00:00 Database checkpointed                      topics="beacnde" dur=566ms533us207ns
    INF 2024-11-01 09:34:21.650+00:00 Slot end                                   topics="beacnde" slot=10302469 nextActionWait=n/a [..]
    INF 2024-11-01 09:34:23.000+00:00 Slot start                                 topics="beacnde" head=ea64550b:10302469 delay=557us907ns finalized=[..]
    WRN 2024-11-01 09:34:23.464+00:00 Failed to exchange configuration with the configured EL end-points topics="elman" completed=0 failed=0 timed_out=1
    INF 2024-11-01 09:34:25.010+00:00 No execution client connected; cannot process block payloads topics="gossip_blocks" executionPayload=[..]
    INF 2024-11-01 09:34:33.739+00:00 Database checkpointed                      topics="beacnde" dur=657ms387us200ns
    INF 2024-11-01 09:34:33.739+00:00 Slot end                                   topics="beacnde" slot=10302470 nextActionWait=n/a nextAttestationSlot=-1 [..]
    INF 2024-11-01 09:34:35.000+00:00 Slot start                                 topics="beacnde" head=4268cf34:10302470 delay=21us116ns finalized=[..]

The sudo netstat -anp|grep 8551 command on Debian shows something like

    tcp    101   0 127.0.0.1:8551     0.0.0.0:*         LISTEN      2598712/nimbus_exec
    tcp    639   0 127.0.0.1:8551     127.0.0.1:58578   CLOSE_WAIT  - 
    tcp  81524   0 127.0.0.1:8551     127.0.0.1:47708   ESTABLISHED -
    tcp    383   0 127.0.0.1:8551     127.0.0.1:41966   CLOSE_WAIT  -
    tcp    639   0 127.0.0.1:8551     127.0.0.1:41724   CLOSE_WAIT  -
    tcp  81524   0 127.0.0.1:8551     127.0.0.1:57700   ESTABLISHED -
    tcp  81524   0 127.0.0.1:8551     127.0.0.1:48442   ESTABLISHED -
    [..]

with one LISTEN entry, 36 CLOSE_WAIT entries, and 18 ESTABLISED entries.

The peer pool running RLPx and Discovery is still actively communicating.

This phenomeonen has been observed for while. The only successful remedy so far was restarting the EL client. Restarting rhe CL node had no effect. In the past there were massive JWT problems observed which is not the case anymore.

casehub2 commented 3 hours ago

Contact the Support page to report your request, initiate a chat with the live agent on the chat button to get more information about your request via Live Chat @mjfh