tendermint / tendermint

⟁ Tendermint Core (BFT Consensus) in Go
https://tendermint.com/
Apache License 2.0
5.67k stars 2.07k forks source link

PBTS spec feedback (Manuel) #8628

Closed angbrav closed 1 year ago

angbrav commented 2 years ago

I have gone through the PBTS spec and have a bunch of comments. Instead of polluting PR #8600 with comments not specific to the PR, I have decided to open this issue. Please let me know if there is a better way to do it.

Main comments:

  1. I do not think it is clear what MSGDELAY exactly means until proposalReceptionTime(p,r) is defined. My problem is that before this definition, it is unclear whether MSGDELAY only accounts for the message delivery delay (in the network sense) or also for the process receiving the proposal having entered the corresponding height and round. I think making sure this is clear from the beginning is fundamental for the spec. Also, sometimes the spec uses receive, others delivery, which makes things even more confusing.
  2. I have some problems with the assumption that PRECISION>>ACCURACY. I think an upper-bound on PRECISION can be derived from (1)|Cp(t) - t| <= ACCURACY, given (2)|Cp(t) - Cq(t)| <= PRECISION. So I think we cannot really assume PRECISION>>ACCURACY. Let me informally elaborate a bit (and please correct me if I am wrong):
    • Consider two processes p and q, whose clocks are in the extremes. So by (1), Cp(t) - t = ACCURACY and t - Cq(t) = ACCURACY.
    • Then, Cp(t) - Cq(t) <= 2*ACCURACY .
    • Thus, by (2), PRECISION<=2*ACCURACY .
  3. I don't quite follow the proof of Derived Proof-of-Locks.
    • "As r > v.round, we can affirm that v was not produced in round r". I am not sure how you get this. Are you assuming that v is unique and can be produced only once? I think we need a case split here. Assume p is the correct process that sent PREVOTE. Then either (i) p had nothing locked, (ii) was locked on v, (iii) locked on something else at a round < v_r.
    • "since a POL(v,r) was produced (by hypothesis) we can affirm that at least one correct process (also) observed a POL(v,v_r)" This assumes that the proposer picks up an existing value with vr != -1. We should consider other cases, even if to discard them by showing that are unfeasible.
    • "the above reasoning can be recursively applied until we get v_r' = v.round and observe a timely proof-of-lock" I would spell out the inductive argument. Once you prove that there exists a POL(v, vr) such that vr<r, then the required follows trivially from the induction hypothesis.
  4. (Extra) It would be nice to have an intuition why safety and liveness hold.

Minor comments to improve presentation:

cason commented 2 years ago
  1. I do not think it is clear what MSGDELAY exactly means until proposalReceptionTime(p,r) is defined. My problem is that before this definition, it is unclear whether MSGDELAY only accounts for the message delivery delay (in the network sense) or also for the process receiving the proposal having entered the corresponding height and round.

So, MSGDELAY does not account for process synchronization, namely, for the distinct times process join a round of consensus. Processes joining rounds at different times is a problem for the consensus algorithm in general, not a limitation introduced by PBTS. Tendermint consensus assumes partial synchrony to ensure that eventually processes are in the same round long enough to propitiate progress.

Being more precise here. The PBTS algorithm assumes that we register the receive time of every PROPOSAL message, to then compare it with the timestamp of the proposal using the timely predicate, in order to decide whether to vote for the proposed value or reject it (vote for nil). But the implementation of the algorithm only considers a PROPOSAL messages that belong to the current round of consensus. Actually, it only considers the first PROPOSAL message of round r delivered to the consensus implementation while it is on the same round r.

cason commented 2 years ago
  1. I have some problems with the assumption that PRECISION>>ACCURACY.

This is really an assumption. We expect operators to chose a very pessimistic value for PRECISION. So, if validators have their clock synchronized (e.g., using NTP) with a given ACCURACY with real time, we expect PRECISION to be chosen as value reasonable larger than 2*ACCURACY.

Extending a little this discussion, that was simplified over the revisions of this document. When you synchronize your local clock with a external source of time, you are trying to synchronized two local clocks. We assume that one of this local clocks, the source of time, is periodically synchronized with a trusted source of real time (e.g., GPS, atomic clock, etc.). When the synchronization method provides your a given "accuracy", it actually tries to compute the precision between the client (node) clock and the server (e.g., NTP) clock and outputs this value with a reasonably high confidence interval. This value is what we are calling ACCURACY here. The synchronization procedure is not continuous, it is performed with a given regularity. During the interval between two synchronizations (usually called the period), the local clock diverges from the reference clock due to clock drift. This potential divergence should be considered to compute the precision of the local clock with respect to the reference clock. When we then consider the clocks of two nodes (two different clients of the synchronization protocol), the precision of their clocks with be larger than the precision of each of theirs clocks with the reference clock, which is larger than the accuracy provided by the clock synchronization mechanism. So,from this perspective, the assumption we make is pretty reasonable.

The question here is whether all this discussion should be included, of course, in a formal version, in the specification.

cason commented 2 years ago

From point (3)

"As r > v.round, we can affirm that v was not produced in round r".

Each proposal is unique, and it carries the round at which it was first proposed. The same proposal can be proposed several times, provided that in the first time it was proposed it received 2f+1 PREVOTES, thus making it a valid proposal. If a proposal produced in round r does not receive enough Prevotes, it will never be re-proposed in future rounds.

We defined proposal as tuple (value, timestamp, round) to make the assumption that proposed values are unique clear. In the implementation this is true, as identical blocks end up being different proposals, as they are signed by different validators. If the same validator is the proposer of multiple rounds, the produced proposals will differ by their timestamps.

cason commented 2 years ago

I think we need a case split here.

On the first bullet of point (3). I am not sure if I understand what you suggested. Who is p in your comments?

cason commented 2 years ago

This assumes that the proposer picks up an existing value with vr != -1. We should consider other cases, even if to discard them by showing that are unfeasible.

By other cases here you mean vr = -1? In this case, the timely predicate will be evaluated for the value, and therefore we are not talking about a derived POL but of a timely POL.

Notice that v.round does not exist in the algorithm, it is just an artifice the we create to prove the algorithm. What matters for the algorithm is whether vr = -1, so we need to do a full check of the proposed value, including its timestamp, or vr > -1, in this case we skip the timestamp check provided we know a POL(v, vr).

cason commented 2 years ago

Once you prove that there exists a POL(v, vr) such that vr<r, then the required follows trivially from the induction hypothesis.

The existence of POL(v,vr) is part of the assumption in this step: if a correct process cast a PREVOTE(r,v) upon receiving a PROPOSAL(r,v,vr), then the processes knows a POL(v,vr). Otherwise, it is not following the algorithm.

cason commented 2 years ago
  1. (Extra) It would be nice to have an intuition why safety and liveness hold.

The correct arguing is the timely and derived POL part. The safety here is a formal representation of the [Time-Validity] property introduced by PBTS. The liveness is just a scenario that enables [Time-Validity] to be observed by all correct processes. It is an "enough but not necessary" condition.

I confess I don't like these two properties that much, but they were conceived from the TLA+ specification.

angbrav commented 2 years ago
  1. I have some problems with the assumption that PRECISION>>ACCURACY.

This is really an assumption. We expect operators to chose a very pessimistic value for PRECISION. So, if validators have their clock synchronized (e.g., using NTP) with a given ACCURACY with real time, we expect PRECISION to be chosen as value reasonable larger than 2*ACCURACY.

Extending a little this discussion, that was simplified over the revisions of this document. When you synchronize your local clock with a external source of time, you are trying to synchronized two local clocks. We assume that one of this local clocks, the source of time, is periodically synchronized with a trusted source of real time (e.g., GPS, atomic clock, etc.). When the synchronization method provides your a given "accuracy", it actually tries to compute the precision between the client (node) clock and the server (e.g., NTP) clock and outputs this value with a reasonably high confidence interval. This value is what we are calling ACCURACY here. The synchronization procedure is not continuous, it is performed with a given regularity. During the interval between two synchronizations (usually called the period), the local clock diverges from the reference clock due to clock drift. This potential divergence should be considered to compute the precision of the local clock with respect to the reference clock. When we then consider the clocks of two nodes (two different clients of the synchronization protocol), the precision of their clocks with be larger than the precision of each of theirs clocks with the reference clock, which is larger than the accuracy provided by the clock synchronization mechanism. So,from this perspective, the assumption we make is pretty reasonable.

The question here is whether all this discussion should be included, of course, in a formal version, in the specification.

I understand that. The problem is that the sentence talks about implementation-specific aspects in the middle of the specification, and it is consufing. I would extend the sentence making clear that in practice, these parameters would be approximated and we expect PRECISION>>ACCURACY.

This is the sentence I am talking about:

"The reason for not adopting ACCURACY as a system parameter is the assumption that PRECISION >> ACCURACY. This allows us to consider, for practical purposes, that the PRECISION system parameter embodies the ACCURACY model parameter."

Something I do not understand is why the specification needs both, given that ones follows from the other. But I guess that's a different discussion.

angbrav commented 2 years ago
  1. I do not think it is clear what MSGDELAY exactly means until proposalReceptionTime(p,r) is defined. My problem is that before this definition, it is unclear whether MSGDELAY only accounts for the message delivery delay (in the network sense) or also for the process receiving the proposal having entered the corresponding height and round.

So, MSGDELAY does not account for process synchronization, namely, for the distinct times process join a round of consensus. Processes joining rounds at different times is a problem for the consensus algorithm in general, not a limitation introduced by PBTS. Tendermint consensus assumes partial synchrony to ensure that eventually processes are in the same round long enough to propitiate progress.

Being more precise here. The PBTS algorithm assumes that we register the receive time of every PROPOSAL message, to then compare it with the timestamp of the proposal using the timely predicate, in order to decide whether to vote for the proposed value or reject it (vote for nil). But the implementation of the algorithm only considers a PROPOSAL messages that belong to the current round of consensus. Actually, it only considers the first PROPOSAL message of round r delivered to the consensus implementation while it is on the same round r.

In my understanding, it does account for process synchronization: MSGDELAY has to be sufficiently large to account for it, otherwise you won't consider the PROPOSAL from a correct process timely. This is actually what the definition of proposalReceptionTime(p,r) says:

proposalReceptionTime(p,r) is the time p reads from its local clock when p is at round r and receives the proposal of round r.

My point is that this is not clear from the beginning and the use of phrases like receiving a proposal are not precise. When the text says receiving a proposal it actually means receiving the message and being in the height and round.

angbrav commented 2 years ago

From point (3)

What's (3)?

"As r > v.round, we can affirm that v was not produced in round r".

Each proposal is unique, and it carries the round at which it was first proposed. The same proposal can be proposed several times, provided that in the first time it was proposed it received 2f+1 PREVOTES, thus making it a valid proposal. If a proposal produced in round r does not receive enough Prevotes, it will never be re-proposed in future rounds.

We defined proposal as tuple (value, timestamp, round) to make the assumption that proposed values are unique clear. In the implementation this is true, as identical blocks end up being different proposals, as they are signed by different validators. If the same validator is the proposer of multiple rounds, the produced proposals will differ by their timestamps.

What does it prevent a Byzantine propose to pick the value of a proposal and repropose as new? It is that the value is signed by the validator thus making two proposals always unique? If that's the case, I think you need to bring that implicit assumption up to the specification, otherwise the proof is flawed. (Maybe it is there already and I missed it)