[Nakamoto] Stacker-Signer binary must handle non-participation

stacks-network / stacks-core

The Stacks blockchain implementation

https://docs.stacks.co

GNU General Public License v3.0

3.01k stars 673 forks source link

[Nakamoto] Stacker-Signer binary must handle non-participation #4111

Closed kantai closed 9 months ago

kantai commented 1 year ago

Related to #4059 (and especially discussion there), the stacks network must be capable of operating even with non-participation of some stackers. While the threshold for participation must clear the 70% requirement outlined in the SIPs, non-participating stackers cannot prevent DKG from completing. This requires updates to the stacker-signer binary such that it supports a protocol that accommodates this.

I think that solving this problem is a form of distributed census, meaning that it will require a BFT protocol. My view on this is that a solution should incorporate a standard BFT protocol, rather than trying to do something ad hoc. But I'm open to being convinced otherwise (if there's a clear argument about why this isn't a byzantine census).

jcnelson commented 1 year ago

The nice things about DKG here are that we already have an a priori signer census from the .pox-4 contract. Specifically, that means that going into DKG:

We already know all of their signing keys with which they'll sign DKG messages.
Each signer already has a one-to-many authenticated broadcast primitive (i.e. StackerDB slots).

These two aspects of the DKG protocol turn detecting Byzantine signers into a trivial, localized decision: if you see equivocation from signer S, you (1) locally declare S to be faulty and (2) post your proof-of-equivocation to your StackerDB slot so everyone else sees the evidence that S is faulty. You can't forge S's proof-of-equivocation without S's signing key, so everyone else will reach the same localized decision that S is faulty. So, all honest nodes eventually detect that S is faulty if S equivocates to any honest node.

I think that means that it's sufficient to just perform multiple rounds of DKG if there are any faulty signers. If a faulty signer is detected, honest nodes restart DKG with the faulty signer excluded from the census. They continue to do so until DKG completes, or fewer than 70% of the signers are honest.

jcnelson commented 1 year ago

Hoping to hear more from @xoloki on this. Did I get the above description right? Also, I elided details on how the system deals with a faulty DKG coordinator, but I'm assuming it's basically the same as dealing with a faulty signer.

xoloki commented 1 year ago

@jcnelson you basically have it right. As we've discussed, though, if we allow DKG to complete with only 70% participation, that means that signing will require 100% participation of those who completed DKG.

There's one other issue though: as currently implemented, each DkgPrivateShare is encrypted to the rcpt. So a byzantine actor who properly sends their DkgPublicShares, but sends corrupted DkgPrivateShares to one or more participants, will break DKG. In the current design, the only way to prove that a DkgPrivateShare is corrupt is to reveal the ECDSA private key of the rcpt.

There is an old design for how to enable proving that a DkgPrivateShare is corrupt without revealing the ECDSA key of the rcpt, but since the ECDSA signing keys were not considered important, I never implemented it. We should decide now how we want to handle this, and I can update the state machines to do it properly.

jcnelson commented 1 year ago

There is an old design for how to enable proving that a DkgPrivateShare is corrupt without revealing the ECDSA key of the rcpt, but since the ECDSA signing keys were not considered important, I never implemented it. We should decide now how we want to handle this, and I can update the state machines to do it properly.

I think we need to do one of two things:

Do this, as described -- make it possible to detect that a DkgPrivateShare is corrupt without divulging the signing key
Make it so that Stackers have to register N one-time-use signing keys when they stack their STX, and just burn through them for up to N DKG rounds. If we can't complete DKG in N rounds, then the system halts. We choose N to be big enough that after N rounds, fewer than 70% of signers would be honest anyway.

kantai commented 11 months ago

I think we're avoiding some complexities of this situation. I think this applies both to DKG and tenure-change/block signing.

We already know all of their signing keys with which they'll sign DKG messages.

This doesn't really alter the requirements of getting consensus in a byzantine environment. Just because a participant knows who they are trying to talk to doesn't mean that they'll be able to detect when a participant is non-responsive in the same way that other participants would detect the same.

Each signer already has a one-to-many authenticated broadcast primitive (i.e. StackerDB slots).

This could change the situation, but the StackerDB doesn't provide this (or at least not with rigorous enough guarantees). Network timing issues are absolutely still possible with the StackerDB. i.e., Peer A can hear from Peer C long before Peer B hears from Peer C, and therefore Peer A and Peer B could absolutely come to different conclusions about whether or not Peer C is participating.

For example,

          Peer A < --- > Peer B < --- > Peer C
T1:      receive B                     receive B
T2:                     receive A
T3:                                    receive A
T4:                     receive C
T5:         | network event |

If the participation deadline is T6, Peers B & C would think the group is A,B,C, but Peer A would think the group is A, B.

My understanding of the FROST paper is that it is kind of hand-wavy about misbehavior -- maybe I've missed something, but the section on misbehavior is:

If one of the signing participants provides an incorrect signature share, SA will
detect that and abort the protocol, if SA is itself behaving correctly. The protocol can
then be rerun with the misbehaving party removed. If SA is itself misbehaving, and
even if up to t − 1 participants are corrupted, SA still cannot produce a valid signature
on a message not approved by at least one honest participant.

But that requires agreement on who is misbehaving, and also requires choosing a new SA if SA is misbehaving in order to preserve liveness. All of this seems to me to be very BFT-flavored problems.

xoloki commented 11 months ago

@kantai you're not entirely wrong about the hand-wavey bits of the FROST paper in re BFT. While it is usually possible to detect malicious participants as part of the protocol, it is not always trivial to handle this in a way that allows the protocol to succeed without manual intervention.

But this is actually an explicit design goal of FROST, especially in contrast to other threshold signature schemes. The RO in FROST stands for Round Optimized, where we choose to make the happy path fast. This is justified because, in a POS style cryptocurrency system, it is possible to financially sanction detected malicious participants. So anyone choosing to be a byzantine actor can at most slow down the system, and they do so at a financial cost that is directly proportional to their ability to slow things down.

But, as you correctly note, there is a difference between being actively malicious (e.g. sending bad DkgPrivateShares or SignatureShares), vs passively malicious (i.e. just not responding). In many ways the passively malicious actors are worse, because the only way to handle them is to use timeouts. And they can always argue that their failure to respond was a network issue, a p2p issue, a stackerdb issue, etc.

So we need to separate what we can handle automatically as part of a BFT protocol, vs governance questions about how to sanction various malicious interactions.

jcnelson commented 11 months ago

This doesn't really alter the requirements of getting consensus in a byzantine environment. Just because a participant knows who they are trying to talk to doesn't mean that they'll be able to detect when a participant is non-responsive in the same way that other participants would detect the same.

I think it does make a couple things simpler if we just did a variant of PBFT in which we pick DKG coordinators in cadence with the arrival of Bitcoin blocks. To keep it simple, let's treat the completion of DKG as a single BFT state-transition -- it either runs to completion and we get an aggregate public key that represents >=70% of the signing slots, or it stalls trying.

We use Bitcoin block arrivals to trigger view changes, which allows the system to dislodge a faulty DKG coordinator without any additional communication from signers. When a signer observes a new Bitcoin block, it uses it to deterministically pick one of the PoX-registered signers to act as the coordinator for this view. For example, each signer could hash the Bitcoin block hash with each signer ID to get a per-signer coordinator hash, and the signer with the lexicographically first coordinator hash becomes the coordinator until the next Bitcoin block arrives. If this signer happens to be slow, offline, or faulty, then the DKG stalls until the next Bitcoin block arrives, in which a new coordinator will be chosen.

Once the coordinator for this view has been determined, it sends out a "DkgPrePrepare" message to its slot with the view ID (i.e. Bitcoin block hash). Signers wait until they see the "DkgPrePrepare" message materialize in the expected slot, or until another viewchange happens (whichever happens first). When the other signers see the "DkgPrePrepare" message for this view and signer ID, then they post a "DkgPrePrepareAck" message to their slots with the same view and signer IDs. Then, after a timeout, the coordinator carries out a census of live signers (i.e. those that sent the ACKs in time) and posts a "DkgPrepare" message with the view ID and census (i.e. the list of "DkgPrePrepareAck"s for this round). That gets us a census of which signers are assumed to be online for the duration of this DKG round.

Now that we have a census, the coordinator carries out DKG as it does today. If the coordinator goes offline, then obviously DKG doesn't complete and the system waits until the next Bitcoin block. If the coordinator equivocates or misbehaves, then signers in the census simply stop participating until the next Bitcoin block, which forces a view change and dislodges the faulty coordinator. If the coordinator is still running DKG when a view change happens, all other signers simply stop participating and wait for the new coordinator to send its "DkgPrePrepare" message.

How does that sound?

EDIT: it seems that the above coordinator is implemented already via the FIRE meta-protocol, in which an honest coordinator can already identify offline, slow, and faulty signers. If we just add the view-change logic to pick coordinators and carry out censuses until DKG completes, I think we'd be in shape.

kantai commented 11 months ago

Right -- I think the FIRE meta-protocol is useful here, but it still needs a coordinator, and a coordinator selection process that can handle byzantine faults.

I do not think that the selection mechanism you outlined above would work. It wouldn't work for actual block signing (because waiting until a new bitcoin block to select a new leader will not work in that case), and even in DKG, it would be fairly unresponsive to disconnected nodes: leader selection does not take into account connectedness, it just randomly selects a leader whether that leader is connected or not. This is why something like RAFT requires prospective leaders to announce their candidacy.

I think we should use a standard BFT leader election protocol (and preferably library) to select the coordinator. I don't think there's much to gain to by trying to develop a bespoke protocol for this.

diwakergupta commented 11 months ago

I think we should use a standard BFT leader election protocol (and preferably library) to select the coordinator. I don't think there's much to gain to by trying to develop a bespoke protocol for this.

💯 yes please -- Aaron, do you have a recommendation?

jcnelson commented 11 months ago

I do not think that the selection mechanism you outlined above would work. It wouldn't work for actual block signing (because waiting until a new bitcoin block to select a new leader will not work in that case), and even in DKG, it would be fairly unresponsive to disconnected nodes: leader selection does not take into account connectedness, it just randomly selects a leader whether that leader is connected or not. This is why something like RAFT requires prospective leaders to announce their candidacy.

Correct me if I'm wrong, but the coordinator is only needed for DKG, not signing rounds? IIRC signing rounds can be run by any Stacker (i.e. each Stacker is the coordinator for the signing round they initiate), and signing rounds for different data can happen in parallel (in principle). If a Stacker discovers that there are two or more ongoing signing rounds for the same data, then they use a deterministic tie-breaker to choose which round instance in which to continue participation and which to abandon.

The concern I was trying to address in my proposal was to minimize the implementation complexity of a BFT DKG protocol by observing that we can treat a coordinator that can't complete DKG in one Bitcoin block time as broken, and automatically select a new coordinator. As long as DKG and aggregate key voting completes within 100 Bitcoin blocks, then we're fine -- we can tolerate the latency overhead of picking an offline or slow Stacker to be the coordinator for one round.

That said, I'm open to using a rigorously vetted BFT leader-election state-machine library if you know of one. It seems that some folks have at least tried to create Raft implementations in Rust that are consistent with its TLA+ spec, but I haven't looked at them in depth.

xoloki commented 11 months ago

I think we should use a standard BFT leader election protocol (and preferably library) to select the coordinator. I don't think there's much to gain to by trying to develop a bespoke protocol for this.

While in general I support using mature protocols (and libraries!), I'm curious why you think coordinator selection would require this but miner selection does not. In my mind they are basically the same task.

xoloki commented 11 months ago

Correct me if I'm wrong, but the coordinator is only needed for DKG, not signing rounds? IIRC signing rounds can be run by any Stacker (i.e. each Stacker is the coordinator for the signing round they initiate), and signing rounds for different data can happen in parallel (in principle).

In theory we don't need a coordinator at all; both DKG and signing rounds complete just fine without it. The question is rather who/how it gets triggered, and how we decide it's complete. The coordinator doesn't have access to any data that anyone else doesn't.

I have tried to keep from doing signing rounds in parallel, just to avoid the complexity in the state machines. But it's totally doable.

kantai commented 11 months ago

Correct me if I'm wrong, but the coordinator is only needed for DKG, not signing rounds?

My understanding is that the coordinator is also necessary for signing rounds: they are responsible for assembling B (the set of Nonce, Commitment pairs for every participant in the round), which stackers could disagree on, and re-assembling the group response (this could be done by anyone, though). I am sure @xoloki could speak more to this than me, though.

Doing leader selection in a BFT way is essentially the view-change protocol of PBFT, and I don't think there's really much in terms of other options there (RAFT is faul-tolerant, but not BFT). As far as libraries go, I don't have a very positive sense for the state of these -- most PBFT implementations are tied to storage systems or transaction processors, which aren't really useful as drop-ins. There are several implementations of PBFT which could be promising, but none of them seem to provide a library interface (Sawtooth PBFT seems to be closest to what we'd want to do: pick a leader, and use it until you have to view-change).

@xoloki mentioned in https://github.com/stacks-network/stacks-core/issues/4169 the ROAST meta-protocol (https://eprint.iacr.org/2022/550.pdf). My initial read of that paper is that provides what the system would actually want: Signing operation is asynchronous (in their terms): when signing a new block, enough concurrent instances of signing operation are instantiated that one is guaranteed to succeed. Each of these instances uses a different semi-trusted coordinator responsible for removing non-responsive or malicious signers from the set. But that protocol only applies to signing -- the DKG itself would seemingly need a different meta-protocol.

While in general I support using mature protocols (and libraries!), I'm curious why you think coordinator selection would require this but miner selection does not. In my mind they are basically the same task.

Miner selection is open-membership and requires spending funds to compete for selection. If a miner spends enough to win a tenure, but then refuses to mine blocks, that's fine-- the network can continue in the next block when a new miner is selected. If the set of signers is incapable of continued operation, there's essentially no recovery.

In theory we don't need a coordinator at all; both DKG and signing rounds complete just fine without it. The question is rather who/how it gets triggered, and how we decide it's complete. The coordinator doesn't have access to any data that anyone else doesn't.

I have tried to keep from doing signing rounds in parallel, just to avoid the complexity in the state machines. But it's totally doable.

Right -- this is something I'd like to push more on. If DKG and signing rounds can operate in a byzantine setting (which stackers are) without a semi-trusted coordinator, then there is no need for another protocol to handle this. Is this what ROAST provides?

jcnelson commented 11 months ago

Is this what ROAST provides?

It seems so, but at the cost of an additional constant communication overhead factor of n - t + 1 (n signers, t threshold). Basically, every Stacker would run a WSTS DKG or signing round instance.

If we feel like that's prohibitively expensive, then we could use a deterministically-chosen semi-trusted coordinator as a happy path (doesn't matter how it's picked; it can't forge messages, only delay them), and fall back to the more expensive coordinator-less protocol if Stackers sense that forward progress is not being made (e.g. a certain number of Bitcoin blocks pass without DKG completing).

jcnelson commented 11 months ago

Just had a huddle to hash this out. The consensus is as follows.

For both DKG and signing rounds, we determine a schedule of Stackers to choose to be the FIRE coordinator. The order is random but deterministic (e.g. derived from chainstate). Stackers use the schedule in a round-robin fashion to choose which of them is going to act as the coordinator for either DKG or a signing round -- whichever is currently needed. A Stacker's tenure as the coordinator last as long as fewer than 70% of Stackers by signing weight vote to advance to the next Stacker in the schedule. The votes to advance are written into StackerDB slots, so eventually, all Stackers learn who the current coordinator is (this vote could be piggybacked onto the existing Packets' headers, for example). Stackers vote to advance to the next coordinator in the schedule if they detect that the coordinator is offline (times out) or faulty. The schedule is not weighted by signing weight, since Stacker-coordinator failures are independent of this.

xoloki commented 11 months ago

wsts v6 is now published, and will be included in @jferrant next PR to address other issues. This will give us separate dkg and sign thresholds, with robustness at all stages of stacks-signer.

There is one improvement that I have been considering; we could allow dkg participants to drop out after sending DkgPublicShares as long as we still get a threshold subset of them that complete dkg. I'll think about the implications and decide if it's safe to do so.

xoloki commented 10 months ago

Ok wsts v7 has been published with full DKG robustness. My placeholder stacks-core PR got accidentally merged yesterday, so I'll work with @jferrant who has an open stacks-core PR #4249 for the message nonce changes also present in v7.

blockstack-devops commented 1 month ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.