Open screwyprof opened 11 months ago
@screwyprof Which API or command are you running exactly? Is it a bunch of calls to PATCH /lighthouse/validators/:voting_pubkey
, or something else?
Hey @michaelsproul, yes exactly that one.
Also, if it helps we are using web3signer definitions and I can see slog errors at the bootstrap when loading a lot of definitions.
Thanks @screwyprof. Just a couple more questions to help us narrow it down and reproduce:
Hey @michaelsproul.
in our prod environment we would expect to update on average 2,500 validators with peaks at 5,000 once per epoch.
Hey @michaelsproul any update for this matter?
Hey @jmcruz1983 haven't had a chance to repro yet, we're pushing a (big) v4.6.0-rc.0 release at the moment. Will try to get some time this week
I had a bit of time to try this today, spun up a VC with 500 inactive keys and spammed 1000x PATCH messages at it. No luck. Everything was fine. The highest memory consumption was 300MB.
I'll try tomorrow on a VC with active keys, and failing that, a VC with active web3signer keys (I don't think it's likely to be specific to web3signer, but it could be).
I didn't manage to repro the OOM, but I can see CPU usage spiking on a validator with 1000 keys. I've started optimising the "no-op" PATCH requests, because they should be easy to catch. Am I correct in thinking that a substantial number of PATCH requests you make are redundant, i.e. should already be reflected in the VC's state? e.g. disabling a validator that is already disabled, setting the fee recipient to the same value that it's already set to.
If so, the no-op optimisations will help (I'll have a branch for you soon). If not, we'll need to do some more involved refactoring and optimising of the key cache, which is a bit of a beast.
Hi @michaelsproul,
Am I correct in thinking that a substantial number of PATCH requests you make are redundant, i.e. should already be reflected in the VC's state? e.g. disabling a validator that is already disabled, setting the fee recipient to the same value that it's already set to.
Yes, this could well be the case. The problem is lighthouse does not provide a GET endpoint to get the state of builder relays in batches (like teku does for example) so that we could only send the changed values.
@screwyprof Great, this patch should help then: https://github.com/sigp/lighthouse/pull/5064
If you could run that for a bit on your testnet infra and let us know how it goes, that would be really helpful.
Longer term we'll also look at optimising the API further. We could also add a GET endpoint, although the load on the VC after this patch should be pretty similar to the load required to serve a get endpoint.
Description
Our Lighthouse instance grapples with a critical performance downturn when executing a batch job to update multiple validators in the local development environment. This degradation not only hinders Lighthouse's efficiency in fulfilling its validator duties but also, in numerous instances, triggers interventions from the OOM killer.
Version
1.69.0
v4.4.1
051c3e84
)Present Behaviour
Lighthouse is hampered by significant performance degradation when executing a batch job to update multiple validators in the local development environment. The symptoms include:
Attempts to capture a CPU profile pointed towards
slog
, indicating potential performance bottlenecks related to logging.Expected Behaviour
Lighthouse should seamlessly update multiple validators without succumbing to notable performance degradation. The application's performance should remain optimal, and interventions from the OOM killer should be eliminated.
Steps to resolve
Efforts to address the issue involved:
async-slog
, with logging consuming 1 second out of a 10-second profile. This observation suggested potential issues withasync-slog
and correlated with the errors in the logs, indicating a logger buffer overflowheaptrack
with no luck.top
command, noting modest but consistent growth.