Lighthouse 5.1.3 lagging behind peers

Aracki commented 5 months ago

Description

Every ~7h we have lags which lasts always around 30-45min. During that time we are noticing spikes in:

CPU
State Tree Hash
Block processing time
Read from disk

During the same time number of "Advanced" peer status rises and "Synced" goes down. This started happening more often in the last 10-15days.

Also, during these spikes we see more beacon_processor_work_events_rx_count for types:

uknown_block_attestation
uknown_block_aggregate

Version

sigp/lighthouse:v5.1.3 together with ethereum/client-go:v1.13.14

Present Behaviour

During the lags we have these logs:

CRIT Beacon block processing error error: ValidatorPubkeyCacheLockTimeout, service: beacon node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 07:11:43.465 WARN BlockProcessingFailure outcome: ValidatorPubkeyCacheLockTimeout, msg: unexpected condition in processing block. node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 07:52:58.710 WARN Block arrived early msg: if this happens consistently, check system clock, block_root: 0x130d0c1cdb6726349bbb897279b5ba72f12e2db7f286b06b029ead82c1b31a30, block_slot: 9020363 node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 09:01:38.960 WARN Beacon chain re-org reorg_distance: 1, new_slot: 9020706, new_head: 0x5c3629d8b4abdd3aadc6b7540501191d39c4d5d0b356e23d8f72b68fe6dbaa78, previous_slot: 9020705, previous_head: 0xe40f2ad93c172960f921f66ff0d6df58e13505bacf8bebb76b9d185ebd5a66d4, service: beacon node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 10:41:32.157 WARN Did not advance head state reason: Err(HeadMissingFromSnapshotCache(0x578b5e58ccfc91e915e6952e42418bf0103a1668f0007ac1a1e08bbbf1e8ba19)), service: state_advance node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 10:41:43.839 WARN Beacon chain re-org reorg_distance: 1, new_slot: 9021206, new_head: 0x7bb5df80bb4c5bde3d969c54d0f92fdc8f2925e26892d9d8788123941976922f, previous_slot: 9021205, previous_head: 0x4898da12ad2515ab7e9c88756a7998414618d45b09b262602bd7606853c35936, service: beacon node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 11:15:44.002 WARN Did not advance head state reason: Err(HeadMissingFromSnapshotCache(0x44b3bb739c08ffba256dcd47b685aa4d7b65d1eafe503fe5909a19023a9bde85)), service: state_advance node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 13:48:56.092 WARN Did not advance head state reason: Err(HeadMissingFromSnapshotCache(0xee3522ec45b08940923a62ab7150172e08cbbf513d34d3d7d1ab5867b3688840)), service: state_advance node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 13:48:56.989 WARN Execution engine call failed error: HttpClient(url: http://127.0.0.1:8551/, kind: timeout, detail: operation timed out), service: exec node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 13:48:56.989 WARN Error whilst processing payload status error: Api { error: HttpClient(url: http://127.0.0.1:8551/, kind: timeout, detail: operation timed out) }, service: exec node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 13:49:08.001 WARN Did not advance head state reason: Err(HeadMissingFromSnapshotCache(0xee3522ec45b08940923a62ab7150172e08cbbf513d34d3d7d1ab5867b3688840)), service: state_advance node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 13:49:17.198 WARN Execution engine call failed error: HttpClient(url: http://127.0.0.1:8551/, kind: timeout, detail: operation timed out), service: exec node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 13:49:17.206 WARN Error whilst processing payload status error: Api { error: HttpClient(url: http://127.0.0.1:8551/, kind: timeout, detail: operation timed out) }, service: exec node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 13:49:20.000 WARN Did not advance head state reason: Err(HeadMissingFromSnapshotCache(0xb774a3fcf26ced28e1697ff7f95fb58dc87b77e8f839c06e60639a690790fa96)), service: state_advance node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 13:49:29.139 WARN Execution engine call failed error: HttpClient(url: http://127.0.0.1:8551/, kind: timeout, detail: operation timed out), service: exec node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 13:49:29.139 WARN Error whilst processing payload status error: Api { error: HttpClient(url: http://127.0.0.1:8551/, kind: timeout, detail: operation timed out) }, service: exec node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 13:49:30.693 WARN Execution engine call failed error: HttpClient(url: http://127.0.0.1:8551/, kind: timeout, detail: operation timed out), service: exec node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 13:49:30.693 WARN Error whilst processing payload status error: Api { error: HttpClient(url: http://127.0.0.1:8551/, kind: timeout, detail: operation timed out) }, service: exec node-ethereum-mainnet-1 node-ethereum-mainnet-lighthouse May 07 13:49:30.708 CRIT Failed to update execution head error: ExecutionForkChoiceUpdateFailed(EngineError(Api { error: HttpClient(url: http://127.0.0.1:8551/, kind: timeout, detail: operation timed out) })), service: beacon

Expected Behaviour

No lag at all, or more information in WARN/CRIT logs.

Steps to resolve

Just wait.

michaelsproul commented 5 months ago

Our upcoming release v5.2.0 (due in a few weeks) should help with this a little, as it makes several caches less likely to miss.

It looks like what happened here is:

Your node was already close to capacity in terms of CPU and RAM usage.
An early block at slot 9020363 triggered a timeout on the pubkey cache.
The timeout on the cache caused block processing to fail, and the snapshot cache to be flushed.
The cache misses created more work which caused more timeouts and failures (a bit of a feedback loop/death spiral).
Eventually things recover until it happens again.

v5.2.0 will prevent the initial disruption at (3) because it doesn't lose cache entries when blocks fail to be processed (for any reason, including timeouts).

Until 5.2 is released, you can try to make your node more resilient to slow periods using --disable-lock-timeouts and --execution-timeout-multiplier 5. This won't fix the root cause though, which seems to be a slightly resourced-constrained machine. What hardware are you running on?

Aracki commented 5 months ago

Thanks for the response. We managed to fix the problem by removing custom set "GOMAXPROCS" which was set to 2, and increasing the memory limits from 24->32Gi. CPU of the VM running geth & lighthouse has 16CPU in total.

sigp / lighthouse