Keepalive with Fibonacci Backoff, fault tolerant + randomised + concurrent orchestrator signing logic and more

RedaOps commented 6 months ago

Hi! This is a pretty interesting project. Read through the code and decided to help implement some stuff :)

Issues

closes #3
closes #4

Implemented Features

generate-committee fails if the provided directory doesn't exist, it now creates it by default
Keepalive between the orchestrator and MPC nodes with fibonacci backoff
Updated rpc requests log levels from info to debug. Imho, info level logs shouldn't be spammed by low level request messages
Made the orchestrator signing logic concurrent, random (maybe this will be replaced with a better selection algorithm in the future?) and fault tolerant
Added a status method to the orchestrator jsonrpc which returns the alive/dead MPC nodes (using the key identifier, not address, which could be private to the orchestrator).
Updated installation instructions - it was not very clear circom and snarkjs binaries are required to do stuff, so I changed the README to be more clear :)

Let's get into the exciting details!

`MemberStatus`

A new MemberStatus enum was implemented which is used to classify the state of an MPC member inside the orchestrator:

Online - The node is online and responsive
Disconnected - The node is unresponsive but there will be reconnection attempts with fibonacci delays
Offline - The node is unresponsive and reconnection will not be attempted anymore. This can happen either because we tried reconnecting too many times in the Disconnected state, or because the node generated a round 1 signing commitment but didn't manage to complete round 2, meaning it is very unreliable and maybe is running a different version than the other nodes.

The status of each node will be held inside a RwLock<MemberStatusState> structure which is passed inside the jsonrpsee context.

`status` jsonrpc method

Pretty self explanatory, this is what the status method returns:

curl -X POST http://127.0.0.1:8890 -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","id":"test","method":"status","params":[11]}'

{"jsonrpc":"2.0","result":{"online_members":["0000000000000000000000000000000000000000000000000000000000000001"],"offline_members":["0000000000000000000000000000000000000000000000000000000000000002","0000000000000000000000000000000000000000000000000000000000000003"]},"id":"test"}

Fibonacci backoff keepalive

When the orchestrator starts, it will ping the newly implemented ping endpoint of each MPC member and determine if it's alive or not. It will keep on querying the ping endpoint every KEEPALIVE_WAIT_SECONDS seconds (defined in constants.rs). It will only ping Online or Disconnected nodes and update the status accordingly. The fibonacci backoff means that if a node doesn't respond (is in the Disconnected state), it will try again in 5, 8, 13, 21 etc seconds until it reaches a KEEPALIVE_MAX_RETRIES number of retries, when it will classify the node as Offline.

New concurrent fault-tolerant signing logic

This is how the new logic works and is pretty fault tolerant:

The orchestrator generates a list of Online nodes and shuffles them.
If the number of available nodes is below the required threshold, the orchestrator returns a not enough available signers error.
The orchestrator picks first n nodes required for the threshold.
It then does the first signing round concurrently.
When checking the results, if any node has not returned a commitment, it classifies that node as Disconnected and jumps back to step 1. This way, it will try again with a new set of nodes.
Otherwise, it then does the second signing round concurrently.
When checking the results, if any node has not returned a signature, it classifies that node as Offline and jumps back to step 1.
Otherwise, it generates the transaction and returns Ok.

Testing

I have tested every new feature, but I highly encourage you to test it out more!

In order to perform the tests for the fault tolerance, I created some new committee keys and spun up my own orchestrator and MPC nodes with a new ZKBITCOIN address. These are the tests I have performed and the results:

Test 1

I tried "using" a zkapp while only 1/3 MPC nodes were online. I have received the following error, which is expected:

Error: error while sending request to orchestrator

Caused by:
    0: bob request failed
    1: RPC error response: RpcError { code: -32001, message: "error while unlocking funds", data: Some(RawValue("the request didn't validate: not enough available signers")) }

Test 2

I spun up 2/3 nodes, but with a catch. One of the running nodes is going to panic when asked for round 2:

async fn round_2_signing(
    params: Params<'static>,
    context: Arc<NodeState>,
) -> RpcResult<Round2Response> {
    panic!("Panic for fun") // panic for testing
    // get commitments from params
    let round2request: [Round2Request; 1] = params.parse()?;
    let round2request = &round2request[0];
    info!("received request: {:?}", round2request);

The orchestrator didn't crash and just returned the same error as in test 1. This is because it tried doing the signing logic, but failed on step 2 and classified the failing node as Offline. It then tried again, but only saw 1/3 available nodes and returned the error.

[2024-01-25T01:33:07Z WARN  zkbitcoin::committee::orchestrator] Round 2 error with http://127.0.0.1:8891, marking as offline and retrying from round 1: error sending request for url (http://127.0.0.1:8891/): connection closed before message completed
[2024-01-25T01:33:09Z WARN  zkbitcoin::committee::orchestrator] http://127.0.0.1:8893 is offline. Re-trying in 34 seconds... (4/10)

Test 3 - the most important test

I also did test 2 but with 3/3 nodes running, and only one of them will crash when asked for round 1 commitment. What happened is that it actually did randomly selected the faulty node, but when it didn't receive a commitment from it, it classified that node as Disconnected, and then tried again with the other 2 nodes and successfully generated a signature and submitted the spend transaction. This is great fault tolerance!

Test 4

I spun up 2/3 nodes with no catches. This should work in this scenario, and it did.

Here is my ZKBITCOIN address: https://mempool.space/testnet/address/tb1przdyzca6zxlykmas4tvdum8qtac3m0ppsfm9p8akfckqkwnw07xs2u9cwf There are 4 transactions: 2 deploys, one spent on test 3 and one spent on test 4

If you have something to add or any suggestions on what I implemented I would love to hear them!

mimoo commented 6 months ago

Looking through the PR now! This is amazing already so a big thank-you again for this PR :D

also for testing crashes and things like that, do you know https://crates.io/crates/fail? Maybe it would be a good idea to create some tests with it (with https://github.com/sigma0-xyz/zkbitcoin/issues/20)

RedaOps commented 6 months ago

Yeah, that's definitely something we should look into. I can do some research and open another PR once I find a good solution. Then we can even include the e2e and the normal tests in the CI pipeline.

sigma0-xyz / zkbitcoin