sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.9k stars 736 forks source link

Atomicity bug while adding validators using validator client API #4984

Open michaelsproul opened 10 months ago

michaelsproul commented 10 months ago

Description

Found by @chong-he:

If the certificate for a web3signer validator is invalid while adding a validator using /lighthouse/validators/web3signer then the validator will get added in-memory but will not be activated and will not be persisted in the validator definitions on disk. Attempting to re-add the same key with correct values then fails with a duplicate key error :skull:

E.g.

Start the VC:

lighthouse vc --http

Create a file named request.json:

[
    {
        "enable": true,
        "description": "validator_one",
        "graffiti": "Mr F was here",
        "suggested_fee_recipient": "0xa2e334e71511686bcfe38bb3ee1ad8f6babcc03d",
        "voting_public_key": "0xa062f95fee747144d5e511940624bc6546509eeaeae9383257a9c43e7ddc58c17c2bab4ae62053122184c381b90db380",
        "url": "http://path-to-web3signer.com",
        "root_certificate_path": "/path/on/vc/filesystem/to/certificate.pem",
        "request_timeout_ms": 12000
    }
]

Make the request once (ensuring certificate.pem does not exist or is invalid):

curl -X POST --data @request.json -H "Content-Type: application/json" -H "Authorization: Bearer "(cat ~/.lighthouse/mainnet/validators/api-token.txt) "http://localhost:5062/lighthouse/validators/web3signer"
{"code":500,"message":"INTERNAL_SERVER_ERROR: failed to initialize validator: \"Unable to add definition: InvalidWeb3SignerRootCertificate(reqwest::Error { kind: Builder, source: Error { code: -50, message: \\\"One or more parameters passed to a function were not valid.\\\" } })\"","stacktraces":[]}

Make the request again:

curl -X POST --data @request.json -H "Content-Type: application/json" -H "Authorization: Bearer "(cat ~/.lighthouse/mainnet/validators/api-token.txt) "http://localhost:5062/lighthouse/validators/web3signer"
{"code":500,"message":"INTERNAL_SERVER_ERROR: failed to initialize validator: \"Unable to add definition: DuplicatePublicKey\"","stacktraces":[]}

We shouldn't get duplicate validator here, because the validator didn't get added successfully. The VC still logs that no validators are active:

INFO No validators present msg: see lighthouse vm create --help or the HTTP API documentation, service: notifier

However we see that the validator is stored in memory, which is why we get the duplicate error:

curl -s -H "Content-Type: application/json" -H "Authorization: Bearer "(cat ~/.lighthouse/mainnet/validators/api-token.txt) "http://localhost:5062/eth/v1/remotekeys" | jq
{
  "data": [
    {
      "pubkey": "0xa062f95fee747144d5e511940624bc6546509eeaeae9383257a9c43e7ddc58c17c2bab4ae62053122184c381b90db380",
      "url": "http://path-to-web3signer.com",
      "readonly": false
    }
  ]
}

Version

Lighthouse v4.5.0

Steps to resolve

Harden the atomicity of adding validators. We could use copy-on-write followed by compare-and-set (my favourite), or something like adding the validator initially disabled, then switching it to enabled once everything has succeeded. Or we could just apply stricter pre-validation before we go mutating anything (although this could still cause a failure if a disk op fails, etc).

chong-he commented 9 months ago

Upon testing, client_identity_path and client_identity_password are having the same bug, i.e., if the first import via HTTP API fails, a second try will show DuplicatePublicKey.

The HTTP API works well if the first import is successful and it gets added to the validator_definitions.yml file as expected.

v4lproik commented 9 months ago

I would love to take this up if some help is needed :)

michaelsproul commented 9 months ago

@v4lproik This issue is quite open-ended and would require a bit of exploration of the design space to solve fully, I think. My current thinking is that we should adopt some approach like software transaction memory or copy-on-write to make our implementation more obviously correct. This would require overhauling some of the datastructures to make them cheaper to copy (persistent data structures).

Then the transaction process would be something like:

  1. Start an in-memory transaction by copying the initialized validators (through an upgradeable rwlock perhaps)
  2. Make the desired mutations; abort if any fail (all changes reverted)
  3. Write to disk
  4. Commit the in-memory transaction by writing the new version of the data (through the upgraded rwlock).

I'm a big fan of CoW though, so I'm a bit biased. It would also be possible to solve this issue by just handling the specific cases highlighted above better, e.g. by doing all operations that can fail (e.g. reading from disk) prior to updating anything permanently in-memory.

Another issue that's tangentially related (and might make testing these code paths more straight-forward) is https://github.com/sigp/lighthouse/issues/4854. That could be a good one to start with to dip your toes in the VC API, perhaps with an eye to solving this issue too.

v4lproik commented 9 months ago

Sounds good! Thank you very much for the clear explanation! I will get on with #4854 instead!