cmwaters commented 2 years ago

Protocol Change Proposal

Currently, Tendermint consensus uses the BFT principle of 2f + 1 as the voting quorum needed to overcome f "byzantine" processes. Given that omission is also a type of fault whereby quorum must be n - f of f omission faults, quorum is thus measured as 2f + 1 from 3f + 1 total power or 2/3+. Thereby having 1/3 fault tolerance to errors of both omission and commission (byzantine)

This, however, treats both of these fault methods as equal in value to the application. This however might not be the case. Some applications may value greater tolerance to omission faults and others, commission faults. I would like to put forth the idea of having a variable quorum either to be set as a one-off by the application or genesis or dynamically as a consensus param. Here are a few examples of variable tradeoffs an application could make with their exposure to modes of failure:

Omission Fault Tolerance	Commission Fault Tolerance	Quorum
0.33	0.33	0.67
0.2	0.6	0.80
0.4	0.2	0.6
0.5	0.0	0.5
0	1.0	1.0

Thus, if q is the quorum (as a percent) selected by the application f_o is n-q and f_c is 2q - n. (NOTE: there might be an off by 1 or a greater than vs great and equal than error in my calculation but I think overall this is correct).

Just for further context on why f_c < 2q - n: Byzantine processes perform equivocation where they double sign for two different blocks and gossip their votes to two partitioned groups in the network. if there are f_c byzantine processes and n - f_c honest processes, then at most we can have two partitoned groups each of (n - f_c)/2. To trick both groups into committing for two different states the quorum must be less than (n-f_c)/2 + f_c (honest votes + byzantine votes) for each group. Doing some algebra you get f_c < 2q - n

For Admin Use

[ ] Not duplicate issue
[ ] Appropriate labels applied
[ ] Appropriate contributors tagged
[ ] Contributor assigned/self-assigned

josef-widder commented 2 years ago

I did some research around this idea some time ago. Here is a paper on clock synchronization that provides a model that combines different fault models: process failures (Definitions 6 and 7), link failures (Definition 8). It then gives an algorithm in Fig. 1 that uses the different fault bounds. One then obtains constraints on the system size that look like:

n > 3*f_a + 2*f_s + 2*f_i + f_c

The analysis for consensus under these fault model seems comparable, and others have looked at this, but perhaps not in the context of partial synchrony (I am sure that there are paper on synchronous systems, though). I am happy to discuss this if people find this path interesting.

cason commented 2 years ago

The most recent work that I know about with this mixed/hybrid model of failures is XFT:

https://www.usenix.org/conference/osdi16/technical-sessions/presentation/liu

Going through its related work, I could find UpRight, from which I think this concept of commision/omission comes:

https://dl.acm.org/doi/pdf/10.1145/1629575.1629602?casa_token=AsGFX_y-GLYAAAAA:NnRfu-4s_-CeAFpLhX-ktQkqx8u4RgNEThdNKJ_6kXUBmvxUZqNdCF68Lj0IWDYQnteRUgHqXcyh

And, apparently, the original concept of hybrid failures comes from this paper (didn't find available PDF):

https://www.computer.org/csdl/proceedings-article/reldis/1988/00025784/12OmNqJq4kR

tendermint / tendermint

Flexible BFT with variable quorum #7944

Protocol Change Proposal

For Admin Use