Open hesusruiz opened 1 year ago
So for example, if we have a total of 21 replicas, instead of supporting 6 bft failures we would like to support "only" 3, but we would like to still operate 21 replicas. In our configuration supporting 3 bft failures, a quorum would require 7 replicas (2f+1) to ensure that two quorums overlap at least in a correct one.
How do they overlap if they are of size 7? The 2f+1 formula no longer holds in such a case. How can a quorum be a third of the total number of nodes?
If you set f
to be less than a third you will still get a quorum that is at least half of the nodes, by the definition of a quorum.
I am not in favor of your suggestion, because the fundamental design principle of this library is that its consumer can use it without understanding the underlying protocol. Letting the user control f
violates this principle because the user needs to understand the minimum and maximum range for f
.
If you set
f
to be less than a third you will still get a quorum that is at least half of the nodes, by the definition of a quorum.
If we use the "traditional" definition of a quorum in the BFT literature, then I agree. This definition assumes implicitly that all participants in the execution of the consensus algorithm are pre-defined and typically only change via a voting of the majority of the current participants.
But with our approach, the requirement for a quorum is more alligned with how electronic voting is performed in real-world associations, which can be expresed as:
"A quorum is the minimum number of members of a group necessary to constitute the group at a meeting, while protecting the group against totally unrepresentative action in the name of the body by an unduly small number of members."
Translating that language to technical terms in the consensus algorithm:
n
f
q
. While in many associations this is set to a majority of the total number of members, in many other groups, especially big ones, the requirement is relaxed and defined in their bylaws. As an extreme case, in the elections of a country, it is never necessary a majority of the citizens able to vote.In our case, the definition of quorum would be q = 2f + 1
, where f
is the desired protection level against byzantine nodes, because this ensures that in every voting round the correct nodes are always a majority versus the byzantine nodes.
I am not in favor of your suggestion, because the fundamental design principle of this library is that its consumer can use it without understanding the underlying protocol. Letting the user control
f
violates this principle because the user needs to understand the minimum and maximum range forf
.
I understand your reasoning, and I am thinking on creating a different (forked) library exactly for this reason.
But anyway, the users of the library have to understand the subtleties of byzantine faults and the differences with respect CFT algorithms like Raft, which are the most prevalent in distributed systems right now.
For example, regarding the level of protection against byzantine failures:
What I mean is that in the current blockchain deployments I know today, most users have no idea of the real protection they have, or they have to calculate it, even if that is one of the most important properties of the system they use.
Anyway, thanks for your answer, and we would probably go ahead with a fork of this library, to avoid any possible confussion to the users, as you mention.
Well, this library is a library for classical voting based consensus, where a quorum is defined to be the smallest set size such that each such two sets intersect at at least one honest node.
If your use case is huge networks (thousands of participants) then I suggest you look at other consensus protocols, namely ones used in the permissionless blockchain space.
If you have a softer assumption on the number of byzantine nodes, then maybe you can try to select a random committee and have that committee run SmartBFT consensus?
This formula gives you the committee size as a function of all nodes, the assumption on the actual byzantine number out of all nodes and the "unsafety probability" (how safe you want the random selection to not pick more than a third of the nodes to be Byzantine.)
First, congratulations for a fantastic library (that for some reason went under my radar until recently).
I wanted to make a PR for some functionality, but first I would like to know your opinion to see if it makes sense and it does not affect safety or liveness. Thanks in advance for that.
All PBFT-based implementations that I know derive f (max. faulty replicas) from n (total number of replicas), and this library is not different (in
computeQuorum(n)
). This is fine if you want to maximise resiliency against failures for a given cost of operating replicas (a givenn
), or alternatively minimise the cost of operating replicas for a given resiliency against byzantine failures.But in our use case, we have many volunteer organisations (both private and public administrations), willing to operate each one a replica, and cost of operation is not a problem. We would like to set the desired resiliency
f
independently ofn
(except for the relationshipn >= 3f + 1
).So for example, if we have a total of 21 replicas, instead of supporting 6 bft failures we would like to support "only" 3, but we would like to still operate 21 replicas. In our configuration supporting 3 bft failures, a quorum would require 7 replicas (2f+1) to ensure that two quorums overlap at least in a correct one. By the way, this configuration would support 10 CFT failures instead of just 6 of the current library.
The changes to enable the configuration of f and modify
computeQuorum(n)
seem easy, but the important thing is if this would affect the implementation in some wrong way.If you are curious about why on earth we would want to "reduce" the resiliency of the network if we already are assuming the cost of operating a big number of replicas, I could elaborate. But in summary, we have been operating in production for 3 years a BFT decentralised network which may become soon much bigger (national scale). We want to have a reasonable BFT resiliency, big CFT resiliency (which is the most common failure we have experienced), and a big number of different entities operating the network collaboratively (there should not be any operator at all).
So, our limit for
n
would be based on the protocol overhead, not on the desired level of BFT resiliency.