hyperledger / besu

An enterprise-grade Java-based, Apache 2.0 licensed Ethereum client https://wiki.hyperledger.org/display/besu
https://www.hyperledger.org/projects/besu
Apache License 2.0
1.47k stars 798 forks source link

Healing flat-db takes node offline for a extended period (48h+) #6543

Open p4fg opened 7 months ago

p4fg commented 7 months ago

Description

As an validator, I want flat database healing to be documented to take a very long time and incurring downtime on mainnet so that i can switch beacon-node before starting healing.

Using 24.1.1, i enabled --Xsnapsync-synchronizer-flat-db-healing-enabled=true and followed the instructions in https://github.com/hyperledger/besu/pull/5319 .

I called the RPC-endpoint on a fully synced besu in order to trigger the healing (as per instructions in the PR):

curl --location --request POST 'http://localhost:8545' \
--header 'Content-Type: application/json' \
--data-raw '{
    "jsonrpc": "2.0",
    "method": "debug_resyncWorldState",
    "params": [],
    "id": 1
}' 

This resulted in the node going offline/out-of-sync while healing, which was fine, for a while. But after 24h of healing it was still not done, and no ETA is presented in the logs:

Feb 08 08:01:07 validator besu[2940849]: 2024-02-08 08:01:07.057+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 337027245 world state trie nodes, Peer count: 25
Feb 08 08:02:07 validator besu[2940849]: 2024-02-08 08:02:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 337861735 world state trie nodes, Peer count: 25
Feb 08 08:03:07 validator besu[2940849]: 2024-02-08 08:03:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 338623741 world state trie nodes, Peer count: 25
Feb 08 08:04:07 validator besu[2940849]: 2024-02-08 08:04:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 339439837 world state trie nodes, Peer count: 25

As this is now the default behaviour, it should probably be documented that this can take a long time and that downtime is expected as this probably will kick in for some users. (if this is the expected behaviour).

Acceptance Criteria

One or more of:

Steps to Reproduce (Bug)

  1. Use fully synced Besu 24.1.1
  2. Trigger healing by RPC-call to debug_resyncWorldState
  3. Watch beacon-node and validators going offline while healing runs for a very long time

Expected behavior:

Actual behavior:

Frequency: N/A

Logs (if a bug)

Besu

Feb 08 08:05:07 validator besu[2940849]: 2024-02-08 08:05:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 340216439 world state trie nodes, Peer count: 25
Feb 08 08:06:07 validator besu[2940849]: 2024-02-08 08:06:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 341065952 world state trie nodes, Peer count: 25
Feb 08 08:07:07 validator besu[2940849]: 2024-02-08 08:07:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 341834726 world state trie nodes, Peer count: 25
Feb 08 08:08:07 validator besu[2940849]: 2024-02-08 08:08:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 342700881 world state trie nodes, Peer count: 25
Feb 08 08:09:07 validator besu[2940849]: 2024-02-08 08:09:07.061+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 343455446 world state trie nodes, Peer count: 25

Lighthouse:

Feb 08 08:09:29 validator lighthouse[2941077]: Feb 08 08:09:29.001 WARN Head is optimistic                      execution_block_hash: 0x712a937bd66dbb74e4c02fb96730a696fdc774dbfdbae31eb81ce905a90066ef, info: chain not fully verified, block and attestation production disabled until execution engine syncs, service: slot_notifier
Feb 08 08:09:29 validator lighthouse[2941077]: Feb 08 08:09:29.001 INFO Synced                                  slot: 8379645, block: 0x6208…772f, epoch: 261863, finalized_epoch: 261861, finalized_root: 0x4c73…912b, exec_hash: 0x712a…66ef (unverified), peers: 86, service: slot_notifier
Feb 08 08:09:29 validator lighthouse[2941077]: Feb 08 08:09:29.001 WARN Syncing deposit contract block cache    est_blocks_remaining: 9971, service: slot_notifier
Feb 08 08:09:36 validator lighthouse[2941077]: Feb 08 08:09:36.611 INFO New block received                      root: 0x7f87a6bb0d18bc933224038f6c6be42dda1cd1fc7426a1daeebe945d9be41038, slot: 8379646
Feb 08 08:09:38 validator lighthouse[2941077]: Feb 08 08:09:38.188 WARN Execution endpoint is not synced        last_seen_block_unix_timestamp: 0, endpoint: http://127.0.0.1:8551/, auth=true, service: deposit_contract_rpc
Feb 08 08:09:38 validator lighthouse[2941077]: Feb 08 08:09:38.188 ERRO Error updating deposit contract cache   error: Failed to get remote head and new block ranges: EndpointError(FarBehind), retry_millis: 60000, service: deposit_contract_rpc
Feb 08 08:09:41 validator lighthouse[2941077]: Feb 08 08:09:41.000 WARN Head is optimistic                      execution_block_hash: 0xb7ba2bfdb522f9472542ecbbe1b643a3ae16b9b15be380bf76d6de2e1c3ff01d, info: chain not fully verified, block and attestation production disabled until execution engine syncs, service: slot_notifier
Feb 08 08:09:41 validator lighthouse[2941077]: Feb 08 08:09:41.000 INFO Synced                                  slot: 8379646, block: 0x7f87…1038, epoch: 261863, finalized_epoch: 261861, finalized_root: 0x4c73…912b, exec_hash: 0xb7ba…f01d (unverified), peers: 86, service: slot_notifier
Feb 08 08:09:41 validator lighthouse[2941077]: Feb 08 08:09:41.000 WARN Syncing deposit contract block cache    est_blocks_remaining: 9971, service: slot_notifier
Feb 08 08:09:48 validator lighthouse[2941077]: Feb 08 08:09:48.320 INFO New block received                      root: 0xc0109ff3a900218a7754dbda5512b71514a9050d1685b2b2e7c6c0fe4298a9e6, slot: 8379647
Feb 08 08:09:53 validator lighthouse[2941077]: Feb 08 08:09:53.000 WARN Head is optimistic                      execution_block_hash: 0x43531a956d77b857df1a49500612cbdeeaa3f338947c0c4fd5cc7b2d31fefbfa, info: chain not fully verified, block and attestation production disabled until execution engine syncs, service: slot_notifier
Feb 08 08:09:53 validator lighthouse[2941077]: Feb 08 08:09:53.001 INFO Synced                                  slot: 8379647, block: 0xc010…a9e6, epoch: 261863, finalized_epoch: 261861, finalized_root: 0x4c73…912b, exec_hash: 0x4353…fbfa (unverified), peers: 86, service: slot_notifier
Feb 08 08:09:53 validator lighthouse[2941077]: Feb 08 08:09:53.001 WARN Syncing deposit contract block cache    est_blocks_remaining: 9971, service: slot_notifier

Versions (Add all that apply)

Additional Information

matkt commented 7 months ago

sometime this call debug_resyncWorldState is not running correctly and trigger a full heal instead of having a correct snapsync. it's an issue we need to fix. I think the best is to resync from scratch with the flat healing flag. it should be good . It's not an issue with the flat db healing but more with debug_resyncWorldState