near / mpc

42 stars 16 forks source link

Vote to kick the non last node will stuck in Resharing #679

Closed ailisp closed 3 months ago

ailisp commented 4 months ago

Description

In a test setup of three nodes, threshold of 2, vote from two nodes to kick the third node. If the third node is not the last node, the protocol contract state stucks at Resharing. This may indicate under some corner case of X nodes and Y threshold, resharing does not work properly.

Repro snippet (in a with_multichain_nodes):

            let state = wait_for::running_mpc(&ctx, Some(0)).await?;
            assert!(state.threshold == 2);
            assert!(state.participants.len() == 3);

            // node_0 goes offline
            let account_0 = near_workspaces::types::AccountId::from_str(
                state.participants.keys().nth(0).unwrap().clone().as_ref(),
            ).unwrap();

            // two node vote to kick it
            assert!(ctx.remove_participant(Some(&account_0)).await.is_ok());

// doesn't work, above nth(1) also not work, only nth(2) works. 
// Log shows that the vote_leave contract calls from right nodes, success, but contract protocol stucks in Resharing
volovyks commented 4 months ago

Yeap, the test code does not look suspicious to me. To clarify, the test is 100% functional if we are removing nth(2) and always fails when we remove nth(0) or nth(1)?

ailisp commented 4 months ago

@volovyks yes

ailisp commented 4 months ago

Figure out it's a protocol bug in state.progress from Resharing to WaitingForConsensus.

2024-07-12T09:11:25.673087Z INFO ThreadId(14) mpc_recovery_node::protocol: protocol unable to progress: CaitSithProtocolError(AssertionFailed("new public key does not match old public key"))

In further, this error comes from (cait-sith) protocol.poke(), this step: https://github.com/LIT-Protocol/cait-sith/blob/3066e9ab165783dc28cb39750105a28bfbb3985e/src/keyshare.rs#L170, will dive into it

ailisp commented 3 months ago

Cait-sith is good, the problem is we pass the wrong participants set to cait-sith. In the case of participants 0, 1, 2, we have contract state old participants:

[
{id: 0, pk: pk0, ...}
{id: 1, pk: pk1, ...}
{id: 2, pk: pk2, ...}
]

we remove p0, new participant in contract state becomes:

[
{id: 0, pk: pk1, ...}
{id: 1, pk: pk2, ...}
]

Passing to cait-sith, this reduced to simply [u32(0), u32(1)], it should be [1,2]

volovyks commented 3 months ago

Great finding @ailisp! It is vital to fix.

ailisp commented 3 months ago

Still wip for a fix. Have managed to make new participant in protocol reserve the correct participant id from old participants, and make protocol advance a little into WaitingForConsensus state, and remove_participant(node0) returns ok. But got mismatched participant error that prevent protocol proceed further, and the test still fails. Will continue debug my fix tomorrow