Add DVT "selections" endpoints to VC

Overview

In https://github.com/ethereum/beacon-APIs/pull/224, there were two new endpoints added to the Beacon APIs:

/eth/v1/validator/beacon_committee_selections
/eth/v1/validator/sync_committee_selections

These two endpoints are not expected to be implemented by client BNs. Rather, these endpoints expected to be implemented by "DVT (distributed validator technology) middleware" that's sitting between the BN and VC. Whilst no one is asking us to implement this for the LH BN, Obol are asking that we implement it in the LH VC.

For clarity, here's a simple illustration:

VC   <---------> DVT Middleware  <--------->  BN
^                ^                            ^
Client-side      Server-side of               Irrelevant
of endpoints     endpoints

These endpoints solve the issue of aggregation selection for attestations and sync aggregates for distributed validators. As you know, a validator is selected to be an aggregated based on the output of their signature of the current slot. The issue with DVT is that distributed validators only have a cryptographic partial of the full keypair that is know to the Beacon Chain. Unfortunately, DVT validators cannot independently sign with the full keypair (only the partial keypair) and therefore cannot determine if their validator is an aggregator or not.

The new endpoints solve this by giving the VC a new endpoint to call which will resolve its partial signature over the slot into a full signature over the slot.

There's more info in this Google doc from Obol.

Implementation

To implement this in Lighthouse, we'd have to make some modifications to the VC. Firstly, we'd need a flag to enable/disable this behavior (perhaps something like --use-dvt-selections-endpoints or a more concise alternative).

Then, we'd need to add additional logic when we're determining aggregate duties to ensure that:

We use the new "committees" endpoints to resolve the partial signature to the full slot signature.
When determining if we're an aggregator, we use the full signature.
When producing an aggregate we use the full signature for the selection_proof field.

Anyone implementing this should give the Obol Google doc a read. Obol has clearly put effort into thinking about what the implementation looks like for client teams.

Additional Considerations

As far as I'm aware, this functionality is exclusive to Obol. I'm aware that we need to be careful about favoring any particular entity. However I'm also aware that sometimes a single entity will "pave roads" and I don't think we want to get in the way of that.

In the Obol Google doc there are two suggestions that I'd personally avoid implementing:

Using the --distributed flag to enable/disable the selections endpoints. Although it's deliciously simple, I fear it is too generic and doesn't give room for competing systems.
Auto-enabling the selections endpoints via matching against Obol software in /eth/v1/node/version output. I'd be open to a more generic method of auto-identification, however I'm not super keen on maintaining a special list of softwares that are allowed to have auto-enabling.

I've seen a comment that this solution goes against the EFs "latest DVT solution". I've also heard similar sentiment from SSV Network, a competitor of Obol who have been pioneering in this space for quite some time. That being said, it looks like the ethereum/distributed-validator-specs haven't had a change to the primary branch since October 2022 (7 months ago). Given that (a) the EF DVT solution smells a little stale and (b) adding the selections endpoints seems safe enough and doesn't exclude an EF/coordinated DVT solution, I'm tempted to give little weight to the argument.

I'm going to share this issue with a few people so that we can get some feedback. Given that DVT generally aligns with our ethos, I think there's a fairly strong case as to why we should implement this. However, I'm particularly interested in the arguments as to why we should not implement this.

Hey guys, love this in depth issue and happy to add a bit of colour in a few places.

The only major point to flag is:

 3. When producing an aggregate we use the partial signature for the selection_proof field.

The full, combined signature, not the partial, should be used as the selection_proof.

Other than that, I would like to give a bit of history of the original spec, and why I think an easily replaceable, read-only middleware based approach is both safer and better for Ethereum's neutrality.

As far as I'm aware, this functionality is exclusive to Obol. I'm aware that we need to be careful about favoring any particular entity. However I'm also aware that sometimes a single entity will "pave roads" and I don't think we want to get in the way of that.

So this is true, and I acknowledge there are others in the space that would like to muddy the waters around whether something is canonical or not, but I would like to add a bit of color as to where this design came from, and why I think it is important for client teams, and not adversely supportive of any one DV implementation over others.

Collin and Mara have been working with the EF on DVs since pre devcon 5. They formed the original trustless validators working group, and later when Obol was started, we co-funded a grant with the EF to develop the original spec with Consensys. The main issue with it, in my eyes, was we made the assumption that changing the base protocol was unviable/out of scope, and that excluded a middleware approach, as it was not possible for a middleware without private key signing power to determine if it was an aggregator.

I effectively was uncomfortable with this restriction, as it would have meant that Obol either 1) needed to build a custom validator client, that we would have to convince operators was superior to existing VCs; or 2) we would have to convince every client team to support a single DV protocol, and that we would have to get the protocol near perfect first try, or face a very slow iteration cycle if we have to get full client team consensus for every breaking change.

To expand on 1) I was reticent to go down the path of making a custom VC. I think the separation of concerns between coming to consensus on what to sign (DVT) and the code that checks slashing conditions and controls the private keys is important. We saw from other DV teams that did not think this was important resulted in slashing 20+ testnet validators due to a bug in their VC, but worse if their supply chain was compromised and malicious code was in charge, every validator could be slashed, not just a subset. Designing your architecture assuming you will be compromised is important, and that's why I was so insistent on a read-only middleware, which if compromised, will pose a liveness risk not a safety risk to its validators. (And those liveness risks we intend to further mitigate with multiple interoperable middleware implementations.) This decision I believe has made Charon and Obol more trustable for node operators to consider running versus a version requiring private key access.

And to expand on 2), this middleware based approach to distributed validators gives new optionality to client teams, if Obol becomes a bad actor, or stops maintaining its client, or simply gets out-competed, everyone can swap in a new/faster/better middleware without any significant changes to their codebase. If we went the route of a canonical single spec all client teams build into their VCs directly, a new, better protocol might never get enough support/adoption. Or if it does we might fragment the client teams into supporting different variations and end up in an xkcd standards problem.

I raised this to Danny and Ben E last June, and asked if they saw this problem space the same way I did, and whether they had any ideas for how we could minimally change the base spec to enable middleware DVs. They were both very supportive of enabling DVs to run as replaceable middlewares, and with their help, we first made this proposal to ethresear.ch, refined it, proved it works by building it ourselves with our own (unreleased, dev-only) VC, and ultimately got it approved and included in the spec. (59:40 for Danny calling this out explicitly)

So to conclude, this was longer than I intended, but I thought it would help provide color on how and why we ended up enabling middleware DVs, and that they are a good idea and this is not overly favouring of a single team/solution. :)

The full, combined signature, not the partial, should be used as the selection_proof.

Thanks for this! I fixed this error in the issue description 🙏

Personally I'm generally not a fan of middleware for two reasons:

It becomes the source of truth for the VC (i.e. it can lie to the BN about what the head of the chain is)
It's not really clear how multiple middlewares scale (e.g. if someone tries to add another piece of middleware it's unclear as to which layer it should be in the middleware sandwich)

Originally mev-boost was going to be a BN/EL middleware, but we successfully avoided that by making it a "sidecar" instead. This protocol has many different properties though, so I don't think a straight analogue between the two is fair or useful.

However, just because I generally don't like middleware it doesn't mean that it's always bad. It could very well be a good solution in this case, especially considering how long it might take to specify and implement an alternative.

Although I don't love middleware, I'm still of the mind that this change is small and safe enough to implement.

@OisinKyne if there's going to be several implementations of these endpoints in VCs, it might be worth providing some sort of "mock" middleware that client teams can use for testing. If client teams don't need to implemented the server side of these endpoints, then each team is going to have to build it's own type of testing/mocking to test them. Deduplicating work is always nice, plus it might give you some sort of conformance testing framework about which VCs you can recommend.

Hi everyone, I wanted to jump in the discussion to give my 2 Satoshi

Me and my team developed the first proof-of-concept DVT as middleware in 2020 together with the EF, same one that later on @OisinKyne joined in testing in it's second iteration after we've invite outsiders to test it out :).

Full disclosure - I'm not against this PR, competition is good for DVT. Having said that I do feel the backstory of why the remote signer approach was adopted is important. Also, the framing of this being a "need" for DVT is a bit misleading as it's unclear it is.

For now, this is an Obol specific PR for Charon which currently has a commercial use license. Even if others would want to use the middleware approach they will develop it from scratch (as they can't use the Charon code) which might result in a whole different API requirements (and additional PRs asking clients for changes). It's not like MEV-Boost which is a public good developed by the community for the community.

Middleware was abandoned in favor of a remote signer approach for several reasons: 1) Before moving to a remote signer approach we conducted calls about this very issue with Lighthouse, Prysm and Teku. It was clear back then, as it's clear now, that further changes will need to be done in the VC code to make it perfectly suitable for DVT (and not just happy flow scenarios). Such changes included timeout configs, api changes to handle late replies and more(happy to expand on it if needed). 2) VC code (excluding signer and slashing protection) is just a coordination software for duties, the sensitive part is the signer itself. 3) There are strong and super scalable remote signer implementations out there which are widely used. 4) Many staking applications use some sort of a remote signer, many don't use the default signer VCs come with. This PR will force them to have the following setup: remote singer <> VC <> DVT <> BN ... instead of just remote signer <> DVT <> BN. More complex with no obvious value. 5) VC <> BN have many moving parts, putting a 3rd wheel between them is an engineering mess. Every change/ fix/ feature will potentially need more changes from client teams. This PR puts a foot in the door which will later require it to open completely once new issues/ changes will be required. 6) Future changes to the beacon chain might require other/ additional "pre-consensus" signing (e.g. randao) that will again require more changes in VC code. Considering that API changes always lag protocol changes it might be the case that a BN fork happened but a DVT implementation can't support it until all VCs push changes as well. Seems risky.

Thinking long term here is crucial as DVT gets into its mainnet phase. We need to have a "no-coupling" approach here so DVT can evolve independently of VC/ BN code and not be fully dependent on them as it will be an engineering nightmare in the future.

@OisinKyne if there's going to be several implementations of these endpoints in VCs, it might be worth providing some sort of "mock" middleware that client teams can use for testing. If client teams don't need to implemented the server side of these endpoints, then each team is going to have to build it's own type of testing/mocking to test them. Deduplicating work is always nice, plus it might give you some sort of conformance testing framework about which VCs you can recommend.

Feedback received, thanks Paul! We will see what we can do on this front, potentially we can PR something into Hive in the near term.

And to refute some points and address some inaccuracies in Alon's statement about 'abandoning' a middleware because I can't not:

It was clear back then, as it's clear now, that further changes will need to be done in the VC code to make it perfectly suitable for DVT

This model is running just fine on mainnet and at scale on testnet. Of course every team can tweak and optimize some performance parameters, but I would like to hear you expand on what other unilateral change is needed, because I do not believe there is one.

VC code (excluding signer and slashing protection) is just a coordination software for duties, the sensitive part is the signer itself

Yes that is the point of separating concerns into coordination software and signing software. Charon is the coordination software and does not need to be near the private keys.

There are strong and super scalable remote signer implementations out there which are widely used.

There are more validator implementations out there than remote signers. Also I believe all validator clients support one or more remote signers, this is a false either or. Also of the three main remote signer implementations out there, one recently caused an outage, and another caused a slashing.

Many staking applications use some sort of a remote signer, many don't use the default signer VCs come with. This PR will force them to have the following setup: remote singer <> VC <> DVT <> BN ... instead of just remote signer <> DVT <> BN. More complex with no obvious value.

Here are some reasons its valuable:

Less to change for an operator to integrate than a new VC-like client, their runbooks and key management can stay the same.
An operator does not have to trust the distributed validator client to the same extent if it is a HTTP server rather than a HTTP client capable of running privileged requests on a sensitive API. They can keep the internet connected DV client separated on a network firewall level from their web3signer.
If the software is compromised, one design can cause much more harm than the other. A VC talking to a signer can change fee recipients or delete keys.

VC <> BN have many moving parts, putting a 3rd wheel between them is an engineering mess. Every change/ fix/ feature will potentially need more changes from client teams.

VC's and BN's communicate over a succint and standardised API. There are three+ implementations of key manager API interfaces.

This PR puts a foot in the door which will later require it to open completely once new issues/ changes will be required.

Ethereum validators have been intended to be MPC friendly since the earliest of serenity designs, with work from V, Dankrad, Justin Drake, Carl Beek and more, on how to design a validator interface that is simple and MPC-compatible. This is not a foot in the door, this is "hey we've gone and built what you planned for and the only snag is this piece left, can you help us finish it?".

it might be the case that a BN fork happened but a DVT implementation can't support it until all VCs push changes as well. Seems risky

Yes this can happen, but as above, the plan since adopting BLS signatures has been for Ethereum validators to be MPC-friendly and communicate over a simple API. This also happens at the remote signer layer, they have to pass pre-signatures amongst themselves, to assemble an aggregate, to conclude what to sign next. Pre-signatures present complexity at any layer, and determinism and avoiding multi-round complexity where possible leads to simpler designs for client teams of all types to implement.

We need to have a "no-coupling" approach here so DVT can evolve independently of VC/ BN code and not be fully dependent on them as it will be an engineering nightmare in the future.

Adopting simple, standard APIs; separating 'what to sign' from 'power to sign it', and favoring optionality instead of replacement software is how we stay out of an engineering nightmare in future. An SSV client extracts Lighthouse from the most important piece of the entire puzzle, and puts an internet-connected piece of software in there instead. If the 'distributed' part of 'distributed validators' is not something that every validator can easily opt into (by using the api they already support), then we will either get stuck on unanimous support of a single protocol that cannot mature/iterate, or we fragment into different VCs speaking different protocols.

This model is running just fine on mainnet and at scale on testnet. Of course every team can tweak and optimize some performance parameters, but I would like to hear you expand on what other unilateral change is needed, because I do not believe there is one. Well Obol requires changes from all clients so for one that's a dependency which should be avoided. Further more timeouts are super sensitive as QBFT protocols by their nature have no upper bound on random delays. Some clients set api timeouts for a few seconds and some more but it means your QBFT protocol doesn't support round changes beyond the first few rounds which means, at scale, this can cause liveness issues just due to the fact VCs are not meant to work this way. I'm sure there are other issues which will naturally raise when you "tweak" software for use cases it's not intended for.

An operator does not have to trust the distributed validator client to the same extent if it is a HTTP server rather than a HTTP client capable of running privileged requests on a sensitive API. They can keep the internet connected DV client separated on a network firewall level from their web3signer. IT security is per the operator to chose, it can be safe or not. Blaming slashing/ outage on remote signers is nit-picking.. all of them are pretty solid projects running tens of thousands(if not more) of validators for years.

If the software is compromised, one design can cause much more harm than the other. A VC talking to a signer can change fee recipients or delete keys. Charon can ignore validators if it's misconfigured. It can't change fee recipient without 2/3 of the cluster being malicious or the core team being malicious (in which case nothing is safe anyhow). With a remote signer approach you have such protections

Ethereum validators have been intended to be MPC friendly since the earliest of serenity designs, with work from V, Dankrad, Justin Drake, Carl Beek and more, on how to design a validator interface that is simple and MPC-compatible. This is not a foot in the door, this is "hey we've gone and built what you planned for and the only snag is this piece left, can you help us finish it?". MPC friendly says nothing about software integrations ...

... and puts an internet-connected piece of software in there instead ... A middleware Charon is still internet connected otherwise it can't talk with the other operators..

Still claiming this approach is overly complicated, creates dependency between 5 implementation teams and 1 DVT team, will require more and more maintenance and changes down the line (for a commercially license project) and really doesn't help with the above points since it "tells" the VC what it wants and it can cause issues if it grows.

I agree with all the points mentioned here. As this issue is relatively old and there's no clear path forward for the general case. I'm suggesting (at least temporarily) we add a CLI flag that allows Lighthouse to be compatible with some DVT software. My personal stance is that we should make Lighthouse compatible with efforts in the space and be agnostic in our support as we can. Grouping functionality that solo validators may not use or that may effect performance behind either a CLI flag or a feature seems like an appropriate path forward, even if we decide its a short-term solution.

For some immediate changes, I've made #4867.

sigp / lighthouse