sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.83k stars 701 forks source link

Batch attestation slashability checking #1914

Open michaelsproul opened 3 years ago

michaelsproul commented 3 years ago

Description

With large numbers of validators and heavy disk congestion, we've observed the slashing protection database timing out with errors like this:

: Nov 16 00:50:31.056 INFO Successfully published attestation      type: unaggregated, slot: 39850, committee_index: 2, head_block: 0x3bd75ebc65de718b36911eaab7dad3a9ef7ca44c7ba03d32c7092195c43bcf43, service: attestation
: Nov 16 00:50:32.838 CRIT Not signing slashable block             error: SQLPoolError("Error(None)")
: Nov 16 00:50:32.838 CRIT Error whilst producing block            message: Unable to sign block, service: block

The default timeout is 5 seconds.

Presently, we sign attestations one at a time and broadcast them, which means that each attester requires a new SQLite database transaction.

https://github.com/sigp/lighthouse/blob/master/validator_client/src/attestation_service.rs#L332-L404

To alleviate the congestion slightly, we could switch to an algorithm like:

  1. Begin an SQLite transaction txn
  2. Check and sign all attestations as part of txn
  3. Commit txn
  4. Broadcast attestations

That way we preserve the property that an attestation is only broadcast if its signature has been persisted to disk (which is crash-safe). Broadcasting after each check but before the transaction commits would violate this property.

Version

Lighthouse v0.3.4

michaelsproul commented 3 years ago

Another potential advantage to batching like this: we might be able to harden SQLite against crashes by performing a disk sync after each batch. Although we would still be vulnerable to disks that don't respect fsync (section 3.1 here https://www.sqlite.org/howtocorrupt.html)

michaelsproul commented 2 years ago

I tried to implement this atop the Web3Signer PR (#2522) just now. I quickly ran into borrowck hell, as the sqlite library doesn't support sharing a Transaction across threads (it is not Send), see: https://github.com/rusqlite/rusqlite/issues/697.

This doesn't currently cause problems at compile time because each call to sign_attestation begins its own local transaction which commits before the async call to the background thread/remote signer is made. However, the database only supports 1 active transaction at a time (for safety), so this is a bottleneck, and will hamper the parallel requests to the remote signer, i.e. all the tasks will need to individually obtain the transaction lock and commit before they hit the remote :(

So far the best solution I can think of is quite spicy :hot_pepper:, and changes one of the fundamental invariants of the validator client. Rather than enforcing:

(1) Only sign an attestation if it is safe to do so

We could enforce:

(2) Only broadcast an attestation if it is safe to do so

This would let us make the slashability checks in one batch after all of the signatures had been created in parallel. No need to hold a lock on a Transaction across await calls, and no lock contention because we just lock and write once. For this to really work though we would need to refactor the attestation service. Currently it spawns one task per committee and index, with a broadcast happening in each task. I need to think more about whether refactoring it to sign all attestations, store all attestations, broadcast all attestations is feasible, and whether it would introduce any other issues.


Alternatively, a more involved solution would be to modify the slashing protection database's locking paradigm so that it can lock on a per-validator basis. We should never be signing more than 1 attestation for the same validator in a single slot, so this would be ideal, but it seems that we wouldn't be able to (easily) achieve this with SQLite: https://softwareengineering.stackexchange.com/questions/340550/why-are-concurrent-writes-not-allowed-on-an-sqlite-database

winksaville commented 2 years ago

Alternatively, a more involved solution would be to modify the slashing protection database's locking paradigm so that it can lock on a per-validator basis. We should never be signing more than 1 attestation for the same validator in a single slot, so this would be ideal, but it seems that we wouldn't be able to (easily) achieve this with SQLite:

It seems to me using another DB should be considered if SQLite doesn't support desirable features that lighthouse could/should support. For example, if it allowed easier implementation of #2557, which is important to me, I'd go in that direction.

realbigsean commented 2 years ago

We've had a report in Discord about the web3signer endpoint taking more than 20 seconds to return a request. I'm thinking it has to do with lock contention on the slashing protection DB. The same user also reported that attesting validators have missed attestations while new validators are being created with the web3signer endpoint.

michaelsproul commented 9 months ago

Gonna start on this for Holesky's sake.

michaelsproul commented 4 months ago

I never got around to implementing this, but would be happy to pair on it with someone

eserilev commented 1 month ago

I'm interested in working on this

michaelsproul commented 1 month ago

@eserilev that would be awesome! Let me know if you have any Qs

eserilev commented 2 weeks ago

Made some decent progress here!

I have a POC branch here: https://github.com/eserilev/lighthouse/tree/attestation-slashability-poc

Heres a quick summary of the changes made so far:

The code itself is fairly messy, was really just trying to get something working on a kurtosis testnet. The local testnet I'm running has 5000 validators running against a single BN. It has 100% attestation performance and has reached finality. Looking at the vc logs, I'm not seeing any timeout issues or other related errors.

Next steps for me is to clean this up, run some more local tests and prepare it for a potential deploy to a Holesky validator