tangle-network / dkg-substrate

Multy-party threshold ECDSA (GG20) Substrate node
https://tangle.webb.tools/
GNU General Public License v3.0
59 stars 15 forks source link

[SPEC] Introduce `SigningManager` and split the Offline stage from the Voting Stage. #393

Closed shekohex closed 1 year ago

shekohex commented 2 years ago

Overview

The current Implementation of the Signing Protocol is not the best implementation we can have, it runs the offline stage and then the voting stage after that every time we see a new proposal. Which is not fast, or not as fast as we need. Also, this implementation does couple the unsigned message to the Offline stage, which is not really necessary at that point, it is only required in the voting stage or the last stage of the signing.

Research

A Better design here is that we will have a SigningManager that holds a list of a pre-computed, ephemeral CompletedOfflineStage which we will use later on to sign any message (in our case, it is an UnsignedProposal).

At the start, the SigningManager will be in its initial state, in that state, it will generate different sets of signers, prepare any other state required for creating many offline stages, and return the SigningManager instance. calling generate_offline_stages(n) will start many tasks in the background where each task will start creating n / k offline stages, where k is the number of signing sets we created previously; so that we end up with n number of CompletedOfflineStage.

Once we have that list, we can simply use them to sign any message/proposal which will be much faster.

Notes

Each new session (TODO: link to the new session spec here) we will reset the SigningManager to its initial state again, and start the process of generating the offline stages again.

Examples

trait SigningManager {
    /// The type of the message we will try to sign.
    type Message: Clone + Encode + Decode + Send + Sync + 'static;
    async fn sign(&self, message: Self::Message) -> Result<(), DKGError>;
    /// Reset the Signing Manager to its initial state.
    fn reset(&self);
}

struct DKGSigners {
    s: Vec<u16>,
    completed_offline_stages: Vec<CompletedOfflineStage>,
}

struct DKGSigningManager {
    signers: HashSet<DKGSigners>,
    messages_to_be_signed: VecDeque<UnsignedProposal>,
}

struct DKGSigningManagerController {
    /// A channel to send commands to the background singing manager task.
    to_manager: Sender<Command>
}

enum Command {
    Sign(UnsignedProposal)
    Reset
}

impl DKGSigningManager {
    pub fn new(n: u16) -> Self {
        // Generate multiple signing sets for signing messages.
        // The goal is to successfully sign proposals immediately in the event that
        // some authorities are not present.
        //
        // For example, if we have authorities: [1,2,3] and we only generate a single
        // signing set (1,2), then if either party is absent, we will not be able to sign
        // until we handle a misbehaviour. Instead, we brute force sign with multiple sets.
        // For `n` authorities, to cover all signing sets of size `t+1`, we need to generate
        // (n choose (t+1)) sets.
        //
        // Sets with the same values are not unique. We only care about all unique, unordered
        // permutations of size `t+1`. i.e. (1,2), (2,3), (1,3) === (2,1), (3,2), (3,1)
    }

    async fn generate_offline_stages(&self, n: u32) -> Result<(), DKGError> {
        // Generate many offline stages for each set, and store them for later usage.
        // this method should run the protocols concurrently on different tasks, for each set.
        // and wait until all of them finishes.
    }

    async fn pick_next_completed_offline_stage(&self) -> Option<CompletedOfflineStage> {
        // randomly pick a set from the signers, and remove a `CompletedOfflineStage`
        // if, None, try the next signing set, and so on..
    }

    async fn run(self) {
        // a long running task, in the background that will be running forever
        // and will get commands from the controller to be executed in that task.
    }
}

impl SigningManager for DKGSigningManagerController {
    type Message = UnsignedProposal;

    async fn sign(&self, message: Self::Message) -> Result<(), DKGError> {
        // here what we should do is to enqueue that message to the `messages_to_be_signed` queue
        // and later the signing manager will dequeue the message and try to sign it.
        // to do so, the controller will send that message as a command to be executed by the signing manager.
    }

    fn reset(&self) {
        // instruct the signing manager to reset its internal state and start over.
    }
}

Questions/Issues

akileshtangella commented 2 years ago

SPEC looks good. I think we can get rid of the SigningManager trait and put those functions directly on the DKGSigningManager struct. Here is another way of organizing the code:

struct OfflineStageManager {
    signers: HashSet<DKGSigners>,
}

impl OfflineStageManager {
    // new, generate_offline_stages, pick_next_completed_offline_stage
}

struct MessagesToBeSigned {
    messages_to_be_signed: VecDeque<UnsignedProposal>
}

impl MessagesToBeSigned {
    // enqueue_message, dequeue_message
}

struct DKGSigningManager {
    offline_stage_manager: OfflineStageManager,
    messages_to_be_signed: MessagesToBeSigned
}

impl DKGSigningManager {
    // sign, reset, run
}

@shekohex let me know what you think.

shekohex commented 2 years ago

Yes, these are good suggestions. I will edit the spec with them. The use of a trait here is just the beginning, as we soon will start refactoring the worker to be into smaller trait system that works together and we can have different implementation of each part that we can swap in or out to test different strategies.

drewstone commented 1 year ago

Linking closed PR for provenance: https://github.com/webb-tools/dkg-substrate/pull/403

tbraun96 commented 1 year ago
/// Goal: given n peers and a threshold of t, ensure at least t+1 peers are able to completely sign a message
/// 
/// Challenge: Peers can be unreliable or even malicious. Networking can be unreliable. Despite these challenges,
/// we want to ensure that the protocol is able to complete in a timely manner.
/// 
/// Currently, for each "signing round", we generate a list of "signing sets" which includes many permutations of peers from the best of the best authorities.
/// Once one completes, it gets submitted to the blockchain. So long as one signing set completes, the signing round is considered
/// complete. This is the naive approach, and suffers from high data volume and redundancy across the network.
/// 
/// New proposal: For each "signing round", we begin with the first "signing trial". A "signing trial" is defined as
/// an attempt at achieving at least one complete signing set. Each "signing trial" has a value, n_sets, which starts at 2.
/// The value of n_sets determines the number of permutations of peers that will be generated for the signing set in this
/// signing trial. For example, if n_sets = 2, then we will generate 2 signing sets, each with t+1 peers. If this trial fails,
/// the next trial doubles the n_sets to 4, and so on until we either complete a signing set or reach a maximum number of
/// signing set permutations (which is equal to nC(t+1))
/// 
/// This approach helps reduce the number of messages sent across the network, and also reduces the degree of redundancy. There
/// is a high chance that the first or second signing trial will succeed, and if it does, we can stop there. If it does not,
/// we can continue to increase the number of signing sets until we reach a maximum number of signing sets. This maximum number
/// of signing sets is determined by the number of peers in the network.
pub struct SigningManagerV2<P> {
    trial_id: usize,
    n: usize,
    t: usize,
    _pd: PhantomData<P>
}

impl<P: Get<u32> + Clone> SigningManagerV2<P> {
    pub fn new(n: usize, t: usize) -> Self {
        Self { n, t, trial_id: 0, _pd: Default::default() }
    }
}

pub struct SigningTrial<T> {
    pub sets: Vec<SigningSet>,
    pub trial_id: usize,
    // what's going to be signed
    pub message: T
}

pub trait TrialBasedSigningManager {
    // The message we are going to sign
    type Message: Clone + Debug + Encode + Decode + Send + Sync + 'static;
    // Generates the trial
    fn next_trial(&mut self) -> Option<SigningTrial<Self::Message>>;
}

impl<P: Get<u32> + Clone> TrialBasedSigningManager for SigningManagerV2<P> {
    type Message = UnsignedProposal<P>;

    fn next_trial(&mut self) -> Option<SigningTrial> {
        let n_sets = 2.pow(self.trial_id + 1);
        let mut sets = Vec::with_capacity(n_sets);

        /**
         * Generate the signing sets
         *
         */

        self.trial_id += 1;
        // ...
    }
}
drewstone commented 1 year ago
tbraun96 commented 1 year ago

(continuing speccing here)

Assumptions

In general, we should make the assumption that there will always be a pool of nodes available that will allow us to completely sign one signing set. This allows us to assume that the TrialBasedSigningManager (TBSM) will always be able to succeed, even if it takes multiple trials.

Ensuring TBSM drives signing stage to success

In order for the TBSM to always succeed, we will need to be able to synchronize the state of the TBSM across the blockchain to allow DKG nodes to synchronistically participate in starting signing rounds. We can add the following definitions and functions to the DKG pallet that are (approximately):

#[derive(Encode, Decode, Clone, Debug)]
pub enum SigningManagerState {
    // only set when we haven't begun signing yet
    NeverStarted,
    // set right before signing is about to begin. New nodes joining will be able to participate
    AboutToBegin { session_id: u64, trial_round: u32 },
    // set while the signing is still in progress. New nodes joining won't be able to participate. Active nodes
    // are expected to retrieve the state after each signing trial, only ending the TBSM if the state is Resolved.
    Active { session_id: u64, trial_round: u32 },
    // set after the signing has finished
    Resolved { session_id: u64, signed_proposal: Proposal }
}

#[derive(Encode, Decode, Clone, Debug)]
/// The DKG clients send this to the OCSM
Enum DKGSigningUpdate {
    Failed { session_id: u64, trial_round: u32 }
}

/// DKG clients query this to find the state
fn signing_manager_state() -> Option<SigningManagerState>;
/// When a signing set has failed, the DKG clients store the update offchain and have it submitted on-chain
fn contribute_observation(DKGSigningUpdate);

The OnChainSigningManager The chain will have to track the state of the DKG clients. To do this, once the keygen stage is detected as complete by the blockchain, the blockchain triggers an update via the OnChainSigningManager (OCSM), which is really just abstraction in the DKG pallet. The OCSM's duty is to keep track of the state, allowing DKG clients to asynchronously attempt to participate in signing, regardless if they just joined or not. Yet, the only time DKG clients can truly begin to participate in signing is when the state is AboutToBegin. Once clients join, they are expected to participate in the TBSM locally

Generating the signing sets

We can generate the signing sets similar to the way we currently do. With the new method, we will not (directly) need a blockchain to ramify the data since the permutations will be deterministic and dependent on the set of best authorities. Instead of spawning nC(t+1) permutations of signing sets, we will instead begin the first signing trial with 2 sets. These 2 sets are considered to be the most likely to succeed. If k determines the signing trial index (starting with 0, where we begin with 2 signing sets), then the number of permutations at signing trial f(k) = 2^(k+1).

Driving forward the TBSM

After keygen stage, the pallet will need code to broadcast an update event to the DKG clients such that the DKG clients receive a SigningManagerState::AboutToBegin update. Then, the DKG clients will generate the signing sets, and spawn the corresponding signing protocols for each. If one of the signing protocols finishes, we will store the signed proposal offchain where it will later be submitted on-chain (similar to how we handle the public key). Once on-chain, the OCSM will update its state and mark itself as SigningManagerState::Resolved. Then, it will emit an event to all DKG clients.

In the case a signing trial fails (meaning all signing sets failed to resolve to completion), then, the DKG clients will each submit a DKGSigningUpdate::Error to the OCSM. The OCSM will then be responsible for collating the results and determining when the begin the next trial round (once all signing sets have failed). Once the next trial round is needed, the OCSM emits an event to the DKG clients, beginning the next trial round, and the process repeats.

shekohex commented 1 year ago

Just some Note about having the State on-chain:

tbraun96 commented 1 year ago

If we choose to use the gossiping engine instead for sharing a state, one thing we could do is ensure that each connected peer gets a "greeter" packet that shares the current state of the signing. Also, whenever the gossip engine sends out a packet, we can append to it data that contains information about the state of the signing. The issue with this approach is that, if a malicious client wants to spoof this packet, then it could throw the state off for everyone. This is why using a blockchain could be helpful here. However, if we decide to use the gossiping engine, we could have a weighting algorithm for updates where the best authorities that share their "proposed next signing manager state" are the most influential in setting the signing state for all peers.

shekohex commented 1 year ago

Proposed Solution

on each finalized block fN we do the following: