sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.87k stars 724 forks source link

Remove lock timeouts from BeaconChain #1096

Closed paulhauner closed 1 month ago

paulhauner commented 4 years ago

Description

Remove old ..._LOCK_TIMEOUT variables from BeaconChain.

Detail

Whilst trying to find a deadlock, we added sometimes to our core Mutex/RwLock so that we'd get errors when things were locking. Now we've identified that the cause was running long-running tasks on the tokio executor, we can clear these. I believe they're all listed here, but part of the issue would be to find others:

https://github.com/sigp/lighthouse/blob/1abb54dabd6af4fa9e79cdc914626e2da3e11a46/beacon_node/beacon_chain/src/beacon_chain.rs#L50-L63

Reasons for removing them can be found here: https://github.com/sigp/lighthouse/issues/1078#issuecomment-621024441

JustinDrake commented 4 years ago

Whilst trying to find a deadlock, we added sometimes to our core Mutex/RwLock so that we'd get errors when things were locking.

Does the consensus code contain any locking logic? (In #1089 you mentioned consensus code is "a series of pure functions completed abstracted from databases, networking etc.". Having zero locking logic would help auditing.)

paulhauner commented 4 years ago

Does the consensus code contain any locking logic?

No* it doesn't. We've gone to significant lengths to avoid this. For example all the signatures sets use a generic "getter" closure to obtain the pubkey during signature verification which allows us to abstract away global caches and other messy things:

https://github.com/sigp/lighthouse/blob/36f213c092eb4802e603cff5901269168637c3f6/eth2/state_processing/src/per_block_processing/signature_sets.rs#L71

*: The rayon parallelization code is perhaps a violation here, depends if you see it as a black box or for what it really is. We're getting rid of that anyway, so I think "no" is fair.

michaelsproul commented 3 years ago

I'll take this

paulhauner commented 3 years ago

I've dropped this back to A1 since it seems a little risky to push this in v1.0.0 if we need to cut it this week. I think we'd need to do some testing on high validator counts and low-spec machines before pushing this to master.

Whilst the lock-timeouts seem unpleasant on paper, they have been useful in practice many times by identifying contention.

michaelsproul commented 3 years ago

I was thinking the exact same. I started the refactor last week and realised just how many locks need changing. It's not trivial to verify there aren't any lock ordering deadlocks lurking in there, and I think we should take the time to do it properly.

michaelsproul commented 2 years ago

Users seem to be running into this increasingly often while syncing. A timeout of 1 second isn't much longer than block processing, which sometimes takes 500ms+ now depending on hardware

As a quick fix I was thinking we could bump the default timeout to a longer value, and add a runtime flag to disable the timeouts entirely (which would require a global atomic bool).

paulhauner commented 2 years ago

If the user was getting pubkey cache timeouts, this might help: https://github.com/sigp/lighthouse/pull/2381

michaelsproul commented 2 years ago

I agree it would be good to get that PR in, and keep the ability to have the lock timeouts for deadlock detection (or even just contention detection). Then for asynchronous safety we can have the mode that disables them entirely, so that even if you're trying to sync on a really underpowered device you can make some progress rather than timing out and reverting constantly

michaelsproul commented 2 years ago

I'll make a PR

michaelsproul commented 1 month ago

Closed for good by:

:tada: