Performance Degradation and OOM Events During API Calls to Update Validators

screwyprof commented 11 months ago

Description

Our Lighthouse instance grapples with a critical performance downturn when executing a batch job to update multiple validators in the local development environment. This degradation not only hinders Lighthouse's efficiency in fulfilling its validator duties but also, in numerous instances, triggers interventions from the OOM killer.

Version

Rust Version: 1.69.0
Production Version: v4.4.1
Dev Version: Latest unstable (commit: 051c3e84)

Present Behaviour

Lighthouse is hampered by significant performance degradation when executing a batch job to update multiple validators in the local development environment. The symptoms include:

A noticeable slowdown in performance during the batch job execution.
Frequent interventions from the OOM killer, leading to the termination of the Lighthouse process.
Modest but consistent memory consumption growth

Errors in the logs, suggesting potential issues with the logger buffer overflow:

Nov 13 21:16:43.600 ERRO slog-async: logger dropped messages due to channel overflow, count: 7
Nov 13 21:16:43.600 ERRO slog-async: logger dropped messages due to channel overflow, count: 5
Nov 13 21:16:43.600 ERRO slog-async: logger dropped messages due to channel overflow, count: 7

Attempts to capture a CPU profile pointed towards slog, indicating potential performance bottlenecks related to logging.

Expected Behaviour

Lighthouse should seamlessly update multiple validators without succumbing to notable performance degradation. The application's performance should remain optimal, and interventions from the OOM killer should be eliminated.

Steps to resolve

Efforts to address the issue involved:

CPU Profiling: Attempted CPU profiling which highlighted that a significant portion of time is allocated to async-slog, with logging consuming 1 second out of a 10-second profile. This observation suggested potential issues with async-slog and correlated with the errors in the logs, indicating a logger buffer overflow
Memory Profiling: Tried to capture heap profile with heaptrack with no luck.
Memory Monitoring: Lighthouse's memory consumption using the top command, noting modest but consistent growth.
Logger Optimisation: Increased logger buffer size and removed logger calls completely at the problematic endpoint, yet still encountered persistent OOM killer interventions.

michaelsproul commented 11 months ago

@screwyprof Which API or command are you running exactly? Is it a bunch of calls to PATCH /lighthouse/validators/:voting_pubkey, or something else?

screwyprof commented 11 months ago

Hey @michaelsproul, yes exactly that one.

screwyprof commented 11 months ago

Also, if it helps we are using web3signer definitions and I can see slog errors at the bootstrap when loading a lot of definitions.

michaelsproul commented 11 months ago

Thanks @screwyprof. Just a couple more questions to help us narrow it down and reproduce:

How many concurrent requests do you make at any given time? Do you have a thread pool of a fixed size or anything that would constrain the number of parallel requests?
How many total requests do you make per batch/period, e.g. 500 in 12s, etc.

screwyprof commented 11 months ago

Hey @michaelsproul.

We don’t do any concurrent requests, they run sequentially.
I was able to reproduce the problem by crafting a bash script which called the API endpoint to disable the same validator in a loop. The problem occurred randomly sometimes with 300 iterations total, sometimes earlier, sometimes after 500.

in our prod environment we would expect to update on average 2,500 validators with peaks at 5,000 once per epoch.

jmcruz1983 commented 9 months ago

Hey @michaelsproul any update for this matter?

michaelsproul commented 9 months ago

Hey @jmcruz1983 haven't had a chance to repro yet, we're pushing a (big) v4.6.0-rc.0 release at the moment. Will try to get some time this week

michaelsproul commented 9 months ago

I had a bit of time to try this today, spun up a VC with 500 inactive keys and spammed 1000x PATCH messages at it. No luck. Everything was fine. The highest memory consumption was 300MB.

I'll try tomorrow on a VC with active keys, and failing that, a VC with active web3signer keys (I don't think it's likely to be specific to web3signer, but it could be).

michaelsproul commented 9 months ago

I didn't manage to repro the OOM, but I can see CPU usage spiking on a validator with 1000 keys. I've started optimising the "no-op" PATCH requests, because they should be easy to catch. Am I correct in thinking that a substantial number of PATCH requests you make are redundant, i.e. should already be reflected in the VC's state? e.g. disabling a validator that is already disabled, setting the fee recipient to the same value that it's already set to.

If so, the no-op optimisations will help (I'll have a branch for you soon). If not, we'll need to do some more involved refactoring and optimising of the key cache, which is a bit of a beast.

screwyprof commented 9 months ago

Hi @michaelsproul,

Am I correct in thinking that a substantial number of PATCH requests you make are redundant, i.e. should already be reflected in the VC's state? e.g. disabling a validator that is already disabled, setting the fee recipient to the same value that it's already set to.

Yes, this could well be the case. The problem is lighthouse does not provide a GET endpoint to get the state of builder relays in batches (like teku does for example) so that we could only send the changed values.

michaelsproul commented 9 months ago

@screwyprof Great, this patch should help then: https://github.com/sigp/lighthouse/pull/5064

If you could run that for a bit on your testnet infra and let us know how it goes, that would be really helpful.

Longer term we'll also look at optimising the API further. We could also add a GET endpoint, although the load on the VC after this patch should be pretty similar to the load required to serve a get endpoint.

sigp / lighthouse