sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.91k stars 738 forks source link

Performance Degradation and OOM Events During API Calls to Update Validators #4936

Open screwyprof opened 11 months ago

screwyprof commented 11 months ago

Description

Our Lighthouse instance grapples with a critical performance downturn when executing a batch job to update multiple validators in the local development environment. This degradation not only hinders Lighthouse's efficiency in fulfilling its validator duties but also, in numerous instances, triggers interventions from the OOM killer.

Version

Present Behaviour

Lighthouse is hampered by significant performance degradation when executing a batch job to update multiple validators in the local development environment. The symptoms include:

Attempts to capture a CPU profile pointed towards slog, indicating potential performance bottlenecks related to logging.

Expected Behaviour

Lighthouse should seamlessly update multiple validators without succumbing to notable performance degradation. The application's performance should remain optimal, and interventions from the OOM killer should be eliminated.

Steps to resolve

Efforts to address the issue involved:

michaelsproul commented 11 months ago

@screwyprof Which API or command are you running exactly? Is it a bunch of calls to PATCH /lighthouse/validators/:voting_pubkey, or something else?

screwyprof commented 11 months ago

Hey @michaelsproul, yes exactly that one.

screwyprof commented 11 months ago

Also, if it helps we are using web3signer definitions and I can see slog errors at the bootstrap when loading a lot of definitions.

michaelsproul commented 11 months ago

Thanks @screwyprof. Just a couple more questions to help us narrow it down and reproduce:

  1. How many concurrent requests do you make at any given time? Do you have a thread pool of a fixed size or anything that would constrain the number of parallel requests?
  2. How many total requests do you make per batch/period, e.g. 500 in 12s, etc.
screwyprof commented 11 months ago

Hey @michaelsproul.

  1. We don’t do any concurrent requests, they run sequentially.
  2. I was able to reproduce the problem by crafting a bash script which called the API endpoint to disable the same validator in a loop. The problem occurred randomly sometimes with 300 iterations total, sometimes earlier, sometimes after 500.

in our prod environment we would expect to update on average 2,500 validators with peaks at 5,000 once per epoch.

jmcruz1983 commented 9 months ago

Hey @michaelsproul any update for this matter?

michaelsproul commented 9 months ago

Hey @jmcruz1983 haven't had a chance to repro yet, we're pushing a (big) v4.6.0-rc.0 release at the moment. Will try to get some time this week

michaelsproul commented 9 months ago

I had a bit of time to try this today, spun up a VC with 500 inactive keys and spammed 1000x PATCH messages at it. No luck. Everything was fine. The highest memory consumption was 300MB.

I'll try tomorrow on a VC with active keys, and failing that, a VC with active web3signer keys (I don't think it's likely to be specific to web3signer, but it could be).

michaelsproul commented 9 months ago

I didn't manage to repro the OOM, but I can see CPU usage spiking on a validator with 1000 keys. I've started optimising the "no-op" PATCH requests, because they should be easy to catch. Am I correct in thinking that a substantial number of PATCH requests you make are redundant, i.e. should already be reflected in the VC's state? e.g. disabling a validator that is already disabled, setting the fee recipient to the same value that it's already set to.

If so, the no-op optimisations will help (I'll have a branch for you soon). If not, we'll need to do some more involved refactoring and optimising of the key cache, which is a bit of a beast.

screwyprof commented 9 months ago

Hi @michaelsproul,

Am I correct in thinking that a substantial number of PATCH requests you make are redundant, i.e. should already be reflected in the VC's state? e.g. disabling a validator that is already disabled, setting the fee recipient to the same value that it's already set to.

Yes, this could well be the case. The problem is lighthouse does not provide a GET endpoint to get the state of builder relays in batches (like teku does for example) so that we could only send the changed values.

michaelsproul commented 9 months ago

@screwyprof Great, this patch should help then: https://github.com/sigp/lighthouse/pull/5064

If you could run that for a bit on your testnet infra and let us know how it goes, that would be really helpful.

Longer term we'll also look at optimising the API further. We could also add a GET endpoint, although the load on the VC after this patch should be pretty similar to the load required to serve a get endpoint.