Open michaelsproul opened 3 years ago
Another potential advantage to batching like this: we might be able to harden SQLite against crashes by performing a disk sync
after each batch. Although we would still be vulnerable to disks that don't respect fsync
(section 3.1 here https://www.sqlite.org/howtocorrupt.html)
I tried to implement this atop the Web3Signer PR (#2522) just now. I quickly ran into borrowck hell, as the sqlite library doesn't support sharing a Transaction
across threads (it is not Send
), see: https://github.com/rusqlite/rusqlite/issues/697.
This doesn't currently cause problems at compile time because each call to sign_attestation
begins its own local transaction which commits before the async call to the background thread/remote signer is made. However, the database only supports 1 active transaction at a time (for safety), so this is a bottleneck, and will hamper the parallel requests to the remote signer, i.e. all the tasks will need to individually obtain the transaction lock and commit before they hit the remote :(
So far the best solution I can think of is quite spicy :hot_pepper:, and changes one of the fundamental invariants of the validator client. Rather than enforcing:
(1) Only sign an attestation if it is safe to do so
We could enforce:
(2) Only broadcast an attestation if it is safe to do so
This would let us make the slashability checks in one batch after all of the signatures had been created in parallel. No need to hold a lock on a Transaction
across await calls, and no lock contention because we just lock and write once. For this to really work though we would need to refactor the attestation service. Currently it spawns one task per committee and index, with a broadcast happening in each task. I need to think more about whether refactoring it to sign all attestations, store all attestations, broadcast all attestations is feasible, and whether it would introduce any other issues.
Alternatively, a more involved solution would be to modify the slashing protection database's locking paradigm so that it can lock on a per-validator basis. We should never be signing more than 1 attestation for the same validator in a single slot, so this would be ideal, but it seems that we wouldn't be able to (easily) achieve this with SQLite: https://softwareengineering.stackexchange.com/questions/340550/why-are-concurrent-writes-not-allowed-on-an-sqlite-database
Alternatively, a more involved solution would be to modify the slashing protection database's locking paradigm so that it can lock on a per-validator basis. We should never be signing more than 1 attestation for the same validator in a single slot, so this would be ideal, but it seems that we wouldn't be able to (easily) achieve this with SQLite:
It seems to me using another DB should be considered if SQLite doesn't support desirable features that lighthouse could/should support. For example, if it allowed easier implementation of #2557, which is important to me, I'd go in that direction.
We've had a report in Discord about the web3signer
endpoint taking more than 20 seconds to return a request. I'm thinking it has to do with lock contention on the slashing protection DB. The same user also reported that attesting validators have missed attestations while new validators are being created with the web3signer
endpoint.
Gonna start on this for Holesky's sake.
I never got around to implementing this, but would be happy to pair on it with someone
I'm interested in working on this
@eserilev that would be awesome! Let me know if you have any Qs
Made some decent progress here!
I have a POC branch here: https://github.com/eserilev/lighthouse/tree/attestation-slashability-poc
Heres a quick summary of the changes made so far:
post_beacon_pool_attestations
beacon api endpoint allows us to post a list of attestations.The code itself is fairly messy, was really just trying to get something working on a kurtosis testnet. The local testnet I'm running has 5000 validators running against a single BN. It has 100% attestation performance and has reached finality. Looking at the vc logs, I'm not seeing any timeout issues or other related errors.
Next steps for me is to clean this up, run some more local tests and prepare it for a potential deploy to a Holesky validator
Description
With large numbers of validators and heavy disk congestion, we've observed the slashing protection database timing out with errors like this:
The default timeout is 5 seconds.
Presently, we sign attestations one at a time and broadcast them, which means that each attester requires a new SQLite database transaction.
https://github.com/sigp/lighthouse/blob/master/validator_client/src/attestation_service.rs#L332-L404
To alleviate the congestion slightly, we could switch to an algorithm like:
txn
txn
txn
That way we preserve the property that an attestation is only broadcast if its signature has been persisted to disk (which is crash-safe). Broadcasting after each check but before the transaction commits would violate this property.
Version
Lighthouse v0.3.4