Closed RedaOps closed 6 months ago
Looking through the PR now! This is amazing already so a big thank-you again for this PR :D
also for testing crashes and things like that, do you know https://crates.io/crates/fail? Maybe it would be a good idea to create some tests with it (with https://github.com/sigma0-xyz/zkbitcoin/issues/20)
Yeah, that's definitely something we should look into. I can do some research and open another PR once I find a good solution. Then we can even include the e2e and the normal tests in the CI pipeline.
Hi! This is a pretty interesting project. Read through the code and decided to help implement some stuff :)
Issues
Implemented Features
generate-committee
fails if the provided directory doesn't exist, it now creates it by defaultinfo
todebug
. Imho, info level logs shouldn't be spammed by low level request messagesstatus
method to the orchestrator jsonrpc which returns the alive/dead MPC nodes (using the key identifier, not address, which could be private to the orchestrator).circom
andsnarkjs
binaries are required to do stuff, so I changed the README to be more clear :)Let's get into the exciting details!
MemberStatus
A new
MemberStatus
enum was implemented which is used to classify the state of an MPC member inside the orchestrator:Online
- The node is online and responsiveDisconnected
- The node is unresponsive but there will be reconnection attempts with fibonacci delaysOffline
- The node is unresponsive and reconnection will not be attempted anymore. This can happen either because we tried reconnecting too many times in theDisconnected
state, or because the node generated a round 1 signing commitment but didn't manage to complete round 2, meaning it is very unreliable and maybe is running a different version than the other nodes.The status of each node will be held inside a
RwLock<MemberStatusState>
structure which is passed inside thejsonrpsee
context.status
jsonrpc methodPretty self explanatory, this is what the status method returns:
Fibonacci backoff keepalive
When the orchestrator starts, it will ping the newly implemented
ping
endpoint of each MPC member and determine if it's alive or not. It will keep on querying the ping endpoint everyKEEPALIVE_WAIT_SECONDS
seconds (defined inconstants.rs
). It will only pingOnline
orDisconnected
nodes and update the status accordingly. The fibonacci backoff means that if a node doesn't respond (is in theDisconnected
state), it will try again in 5, 8, 13, 21 etc seconds until it reaches aKEEPALIVE_MAX_RETRIES
number of retries, when it will classify the node asOffline
.New concurrent fault-tolerant signing logic
This is how the new logic works and is pretty fault tolerant:
Online
nodes and shuffles them.not enough available signers
error.n
nodes required for the threshold.Disconnected
and jumps back to step 1. This way, it will try again with a new set of nodes.Offline
and jumps back to step 1.Testing
I have tested every new feature, but I highly encourage you to test it out more!
In order to perform the tests for the fault tolerance, I created some new committee keys and spun up my own orchestrator and MPC nodes with a new
ZKBITCOIN
address. These are the tests I have performed and the results:Test 1
I tried "using" a zkapp while only 1/3 MPC nodes were online. I have received the following error, which is expected:
Test 2
I spun up 2/3 nodes, but with a catch. One of the running nodes is going to panic when asked for round 2:
The orchestrator didn't crash and just returned the same error as in test 1. This is because it tried doing the signing logic, but failed on step 2 and classified the failing node as
Offline
. It then tried again, but only saw 1/3 available nodes and returned the error.Test 3 - the most important test
I also did test 2 but with 3/3 nodes running, and only one of them will crash when asked for round 1 commitment. What happened is that it actually did randomly selected the faulty node, but when it didn't receive a commitment from it, it classified that node as
Disconnected
, and then tried again with the other 2 nodes and successfully generated a signature and submitted the spend transaction. This is great fault tolerance!Test 4
I spun up 2/3 nodes with no catches. This should work in this scenario, and it did.
Here is my
ZKBITCOIN
address: https://mempool.space/testnet/address/tb1przdyzca6zxlykmas4tvdum8qtac3m0ppsfm9p8akfckqkwnw07xs2u9cwf There are 4 transactions: 2 deploys, one spent on test 3 and one spent on test 4If you have something to add or any suggestions on what I implemented I would love to hear them!