Stricter "failed" node definition for dBFT

roman-khimov commented 3 years ago

Summary or problem description We're using the notion of "failed" nodes for some dBFT logic (even though technically it'd be better to call them "unknown", we can't really judge if the node failed or not in BFT algorithm) and we count them this way:

Validators.Count(p => !LastSeenMessage.TryGetValue(p, out var value) || value < (Block.Index - 1))

Which means that any node that we didn't receive any message from for current or previous block is considered to be failed. This value is then used to determine whether to change view or not or whether to accept some messages or not.

I think it's problematic in that availability of the node a block ago doesn't say anything about its state for current block. Moreover, even for current block presence at view N doesn't mean that the node is fine at N+1. This is especially relevant wrt liveness lock described in paper from #2029 (and observed many times both in tests and in the wild), the node not receiving any packets initially will try to change view and if some node is locked in the Commit state it'll help others gather the required number of CVs and jump to the next view.

Do you have any solution you want to propose? I'd like to share and propose to adopt the modification we're using in neo-go to count "failed" nodes (initially added to 0.72.0 release February this year with a modification applied in 0.76.0, June this year) and we're doing it this way:

        for _, hv := range c.LastSeenMessage {
                if hv == nil || hv.Height < c.BlockIndex || hv.View < c.ViewNumber {
                        count++
                }
        }

Any node that we didn't receive anything from in current view of current height is considered to be in unknown state and added to this counter. It makes liveness lock less likely to happen (making consensus more robust), but we're not sure it solves this problem completely (most likely it's not).

Neo Version

Neo 2
Neo 3

Where in the software does this update applies to?

Consensus

vncoelho commented 3 years ago

We tried this configuration before, @roman-khimov.

I do not remember exactly, but by also considering messages on the current view we would skip change view as soon as messages arrives from, at least, 2f + 1 nodes: https://github.com/neo-project/neo/blob/5d5bd0222a4d2f10cf90f3b9b72386ae77ca750f/src/neo/Consensus/ConsensusService.cs#L639

I believe that this modification can be, yes, be implemented.

As we are discussing in your another recent opened issue #2058, I remember from our past experiments and trials that CountFailed is not really needed if we keep consistency across views. That is, by implementing an improvement such as dBFT 2.0+ we are probably going to be able to remove this synchrony condition.

vncoelho commented 2 years ago

@roman-khimov, it is more conservative to keep in the way it is now, considering the last view and the current view.

The LastSeenMessage is updated every time OnConsensusPayload is received.

I suggest that you update neo-go to be more conservative and consider inactivity from previous round, in order to avoid premature change view in edge cases.

roman-khimov commented 2 years ago

premature change view in edge cases.

We have exactly this without this change. Nodes that don't receive some messages in this round for any reason start a CV, with this change they instead try to make a recovery first and only if/when they get messages from other nodes they start a CV. So I think we get more conservative behavior with this change and it's also well-tested by now in production.

vncoelho commented 2 years ago

I see, @roman-khimov. I had confused my logic, you are right. In fact your implementation is more strict in terms of avoiding CV, because it does not consider previous messages. I believe that the first version of this mechanism in dbft 2.0 was like that.

neo-project / neo

Stricter "failed" node definition for dBFT #2057