Closed Aracki closed 5 months ago
Our upcoming release v5.2.0 (due in a few weeks) should help with this a little, as it makes several caches less likely to miss.
It looks like what happened here is:
v5.2.0 will prevent the initial disruption at (3) because it doesn't lose cache entries when blocks fail to be processed (for any reason, including timeouts).
Until 5.2 is released, you can try to make your node more resilient to slow periods using --disable-lock-timeouts
and --execution-timeout-multiplier 5
. This won't fix the root cause though, which seems to be a slightly resourced-constrained machine. What hardware are you running on?
Thanks for the response. We managed to fix the problem by removing custom set "GOMAXPROCS" which was set to 2, and increasing the memory limits from 24->32Gi. CPU of the VM running geth & lighthouse has 16CPU in total.
Description
Every ~7h we have lags which lasts always around 30-45min. During that time we are noticing spikes in:
During the same time number of "Advanced" peer status rises and "Synced" goes down. This started happening more often in the last 10-15days.
Also, during these spikes we see more
beacon_processor_work_events_rx_count
for types:Version
sigp/lighthouse:v5.1.3
together withethereum/client-go:v1.13.14
Present Behaviour
During the lags we have these logs:
Expected Behaviour
No lag at all, or more information in WARN/CRIT logs.
Steps to resolve
Just wait.