Consensus node error while recovering from crash - f+1 commits and still changed view

vncoelho commented 1 year ago

Bug found while testing dBFT 3.0 PR.

However, it is suspected that this can also happen on current version:

neo> [12:58:47.717] OnStart
[12:58:49.169] Initialize: height=123019 view=0 index=3 role=PrimaryP1
[12:58:49.180] Sending RecoveryRequest: height=123019 view=0 nc=0 nf=0
[12:58:50.182] Sending PrepareRequest: height=123019 view=0 Id=0
[12:58:51.123] OnRecoveryMessageReceived: height=123019 view=0 index=0 pId=0
[12:58:51.135] OnCommitReceived: height=123019 view=0 index=0 nc=0 nf=0
[12:58:51.139] OnCommitReceived: height=123019 view=0 index=2 nc=0 nf=0
[12:58:51.141] Recovery finished: (valid/total) ChgView: 0/0 PrepReq: 0/0 PrepResp: 0/2 PreCommits: 3/3 Commits: 2/2
[12:58:51.143] OnCommitReceived: height=123019 view=0 index=0 nc=0 nf=0
[12:58:51.145] OnCommitReceived: height=123019 view=0 index=2 nc=0 nf=0
[12:58:56.538] Sending ChangeView: height=123019 view=0 nv=1 nc=0 nf=0 reason=Timeout
[12:58:57.005] OnRecoveryMessageReceived: height=123019 view=1 index=1 pId=0
[12:58:57.015] OnChangeViewReceived: height=123019 view=0 index=1 nv=1 reason=Timeout
[12:58:57.018] OnChangeViewReceived: height=123019 view=0 index=2 nv=1 reason=Timeout
[12:58:57.022] View changed: view=1 primary=03bbc4a365248c965200d7957e2575e14e932a04f7e43f834c689bbdcb726c44c3
[12:58:57.024] Initialize: height=123019 view=1 index=3 role=Backup

vncoelho commented 1 year ago

@ZhangTao1596, I am not sure this is just for dBFT 3.0. But also for dBFT 2.0.

vncoelho commented 1 year ago

In fact, it is even described in the current code. I think we wrote that some long time ago.

" // A possible attack can happen if the last node to commit is malicious and either sends change view after his // commit to stall nodes in a higher view, or if he refuses to send recovery messages. In addition, if a node // asking change views loses network or crashes and comes back when nodes are committed in more than one higher // numbered view, it is possible for the node accepting recovery to commit in any of the higher views, thus // potentially splitting nodes among views and stalling the network."

vncoelho commented 1 year ago

It also looks like this logic is not correct anymore:

        public bool NotAcceptingPayloadsDueToViewChanging => ViewChanging && !MoreThanFNodesCommittedOrLost;
        public bool MoreThanFNodesCommittedOrLost => (CountCommitted + CountFailed) > F;

shargon commented 1 year ago

@vncoelho Could you add a test in order to ensure that v2 is affected?

vncoelho commented 1 year ago

I could, but all test structure was removed when dbft was migrated to a Module, from neo core to here.

We need to add that previous infrastructure for akka testing here.

vncoelho commented 1 year ago

https://github.com/neo-project/neo/commit/3d640806ffe04ec634a2025abe5a7d966fa47d70

vncoelho commented 1 year ago

We need to mock everything here as we used to do before, @shargon.

Do you have a good direction for starting it to work? I could them add all tests. (ping @erikzhang)

ZhangTao1596 commented 1 year ago

I think this problem has always exists in master branch. Primary node will create different PrepareReqeust after restart if it hasn't committed, So that it will refuse previous PrepareResponse in RecoveryMessage.

vncoelho commented 1 year ago

@ZhangTao1596, this point you mentioned is another problem related to the PrepareRequest, which should also be fixed in the future. Perhaps it is good to open another issue explaining that one in order that we do not forget this possibility.

neo-project / neo-modules

Consensus node error while recovering from crash - f+1 commits and still changed view #780