PBTS spec feedback (Manuel)

I have gone through the PBTS spec and have a bunch of comments. Instead of polluting PR #8600 with comments not specific to the PR, I have decided to open this issue. Please let me know if there is a better way to do it.

Main comments:

I do not think it is clear what MSGDELAY exactly means until proposalReceptionTime(p,r) is defined. My problem is that before this definition, it is unclear whether MSGDELAY only accounts for the message delivery delay (in the network sense) or also for the process receiving the proposal having entered the corresponding height and round. I think making sure this is clear from the beginning is fundamental for the spec. Also, sometimes the spec uses receive, others delivery, which makes things even more confusing.
I have some problems with the assumption that PRECISION>>ACCURACY. I think an upper-bound on PRECISION can be derived from (1)|Cp(t) - t| <= ACCURACY, given (2)|Cp(t) - Cq(t)| <= PRECISION. So I think we cannot really assume PRECISION>>ACCURACY. Let me informally elaborate a bit (and please correct me if I am wrong):
- Consider two processes p and q, whose clocks are in the extremes. So by (1), Cp(t) - t = ACCURACY and t - Cq(t) = ACCURACY.
- Then, Cp(t) - Cq(t) <= 2*ACCURACY .
- Thus, by (2), PRECISION<=2*ACCURACY .
I don't quite follow the proof of Derived Proof-of-Locks.
- "As r > v.round, we can affirm that v was not produced in round r". I am not sure how you get this. Are you assuming that v is unique and can be produced only once? I think we need a case split here. Assume p is the correct process that sent PREVOTE. Then either (i) p had nothing locked, (ii) was locked on v, (iii) locked on something else at a round < v_r.
- "since a POL(v,r) was produced (by hypothesis) we can affirm that at least one correct process (also) observed a POL(v,v_r)" This assumes that the proposer picks up an existing value with vr != -1. We should consider other cases, even if to discard them by showing that are unfeasible.
- "the above reasoning can be recursively applied until we get v_r' = v.round and observe a timely proof-of-lock" I would spell out the inductive argument. Once you prove that there exists a POL(v, vr) such that vr<r, then the required follows trivially from the induction hypothesis.
(Extra) It would be nice to have an intuition why safety and liveness hold.

Minor comments to improve presentation:

I think we could improve the description of BFTTime with minimal changes.
- Reshuffle things a bit. Before describing how it is implemented, I would describe what it guarantees: monotonicity and that it is not influenced by Byzantine validators. It is all there, but scattered. Then it is easy to relate the implementation decisions to the properties.
- I would better motivate why having a relation to realtime is important, maybe with a simple example. So instead of linking a different file, I would simply copy what's there, here (or parts). Even for me, it is still unclear what may breaks.
"Timestamps are retrieved from the validators' local clocks, with the only restriction that they must be monotonic" The use of monotonic here is consfusing. For me that sentence only means that the local time at a validator must grow monotonically. While the next sentence clarifies it, I wouldn't use monotonic here.
"Validators can accept or reject a proposed block. A block is only accepted if its timestamp is acceptable". There are several reason why a validator may reject a proposal. I would rephrase that to something like: "Validators can accept or reject a proposed block based on the proposed timestamp. A proposed timestamp..."
"Synchronous parameters" I would define what you mean by synchronous parameters. Are those parameters that capture time somewhat? One option is to remove synchronous from the sentence.
"the end-to-end delay for delivering a message to all correct validators is bounded by MSGDELAY". As argued above, this needs to be very clearer. I believe that end-to-end delay is vague.
"PRECISION and MSGDELAY are consensus parameters" What does this imply and why is a requirement. i.e., do things break if we do not assume this?
"The left inequality of the timely predicate establishes that proposed timestamps should be in the past, when adjusted by the clocks PRECISION" I cannot really parse this sentence. I would rather say that proposed timestamps can be in the future by PRECISION at most. I think in essence, it says the same probably, but I find the former sentence confusing.
"level detail" level of detail
"The assumption that processes have access to synchronized clocks ensures that proposal times assigned by correct processes have a bounded relation with the real time" I think you need more than synchronized clocks. e.g., message delay bound (as in partial Synchrony) , the time for a validator to enter a legit round of a specific height is bound...
"If p sends a proposal message m at real time t and q receives m at real time t', then t <= t' <= t + MSGDELAY." I would avoid the verb receive in general when talking to MSGDELAY, I find it very confusing.
"For PBTS, a proposal is a tuple (v, v.time, v.round), where:" Maybe clarify that the analysis ignores heights?. Or say given an instance of consensus...
"proposalReceptionTime(p,r) is the time p reads from its local clock when p is at round r and receives the proposal of round r." This really comes to late. There should be a text explaining this at the beginning.
"let p is a such correct process:" let p BE such correct...
"Instead, by the protocol operation, v was a valid value for the proposer of round r, which means that if the proposer has observed a POL(v,vr) with vr < r." This does not really add anything to the proof, meaning that assumes a correct proposer.

I do not think it is clear what MSGDELAY exactly means until proposalReceptionTime(p,r) is defined. My problem is that before this definition, it is unclear whether MSGDELAY only accounts for the message delivery delay (in the network sense) or also for the process receiving the proposal having entered the corresponding height and round.

So, MSGDELAY does not account for process synchronization, namely, for the distinct times process join a round of consensus. Processes joining rounds at different times is a problem for the consensus algorithm in general, not a limitation introduced by PBTS. Tendermint consensus assumes partial synchrony to ensure that eventually processes are in the same round long enough to propitiate progress.

Being more precise here. The PBTS algorithm assumes that we register the receive time of every PROPOSAL message, to then compare it with the timestamp of the proposal using the timely predicate, in order to decide whether to vote for the proposed value or reject it (vote for nil). But the implementation of the algorithm only considers a PROPOSAL messages that belong to the current round of consensus. Actually, it only considers the first PROPOSAL message of round r delivered to the consensus implementation while it is on the same round r.

I have some problems with the assumption that PRECISION>>ACCURACY.

This is really an assumption. We expect operators to chose a very pessimistic value for PRECISION. So, if validators have their clock synchronized (e.g., using NTP) with a given ACCURACY with real time, we expect PRECISION to be chosen as value reasonable larger than 2*ACCURACY.

Extending a little this discussion, that was simplified over the revisions of this document. When you synchronize your local clock with a external source of time, you are trying to synchronized two local clocks. We assume that one of this local clocks, the source of time, is periodically synchronized with a trusted source of real time (e.g., GPS, atomic clock, etc.). When the synchronization method provides your a given "accuracy", it actually tries to compute the precision between the client (node) clock and the server (e.g., NTP) clock and outputs this value with a reasonably high confidence interval. This value is what we are calling ACCURACY here. The synchronization procedure is not continuous, it is performed with a given regularity. During the interval between two synchronizations (usually called the period), the local clock diverges from the reference clock due to clock drift. This potential divergence should be considered to compute the precision of the local clock with respect to the reference clock. When we then consider the clocks of two nodes (two different clients of the synchronization protocol), the precision of their clocks with be larger than the precision of each of theirs clocks with the reference clock, which is larger than the accuracy provided by the clock synchronization mechanism. So,from this perspective, the assumption we make is pretty reasonable.

The question here is whether all this discussion should be included, of course, in a formal version, in the specification.

From point (3)

"As r > v.round, we can affirm that v was not produced in round r".

Each proposal is unique, and it carries the round at which it was first proposed. The same proposal can be proposed several times, provided that in the first time it was proposed it received 2f+1 PREVOTES, thus making it a valid proposal. If a proposal produced in round r does not receive enough Prevotes, it will never be re-proposed in future rounds.

We defined proposal as tuple (value, timestamp, round) to make the assumption that proposed values are unique clear. In the implementation this is true, as identical blocks end up being different proposals, as they are signed by different validators. If the same validator is the proposer of multiple rounds, the produced proposals will differ by their timestamps.

I think we need a case split here.

On the first bullet of point (3). I am not sure if I understand what you suggested. Who is p in your comments?

This assumes that the proposer picks up an existing value with vr != -1. We should consider other cases, even if to discard them by showing that are unfeasible.

By other cases here you mean vr = -1? In this case, the timely predicate will be evaluated for the value, and therefore we are not talking about a derived POL but of a timely POL.

Notice that v.round does not exist in the algorithm, it is just an artifice the we create to prove the algorithm. What matters for the algorithm is whether vr = -1, so we need to do a full check of the proposed value, including its timestamp, or vr > -1, in this case we skip the timestamp check provided we know a POL(v, vr).

Once you prove that there exists a POL(v, vr) such that vr<r, then the required follows trivially from the induction hypothesis.

The existence of POL(v,vr) is part of the assumption in this step: if a correct process cast a PREVOTE(r,v) upon receiving a PROPOSAL(r,v,vr), then the processes knows a POL(v,vr). Otherwise, it is not following the algorithm.

(Extra) It would be nice to have an intuition why safety and liveness hold.

The correct arguing is the timely and derived POL part. The safety here is a formal representation of the [Time-Validity] property introduced by PBTS. The liveness is just a scenario that enables [Time-Validity] to be observed by all correct processes. It is an "enough but not necessary" condition.

I confess I don't like these two properties that much, but they were conceived from the TLA+ specification.

I have some problems with the assumption that PRECISION>>ACCURACY.

This is really an assumption. We expect operators to chose a very pessimistic value for PRECISION. So, if validators have their clock synchronized (e.g., using NTP) with a given ACCURACY with real time, we expect PRECISION to be chosen as value reasonable larger than 2*ACCURACY.

Extending a little this discussion, that was simplified over the revisions of this document. When you synchronize your local clock with a external source of time, you are trying to synchronized two local clocks. We assume that one of this local clocks, the source of time, is periodically synchronized with a trusted source of real time (e.g., GPS, atomic clock, etc.). When the synchronization method provides your a given "accuracy", it actually tries to compute the precision between the client (node) clock and the server (e.g., NTP) clock and outputs this value with a reasonably high confidence interval. This value is what we are calling ACCURACY here. The synchronization procedure is not continuous, it is performed with a given regularity. During the interval between two synchronizations (usually called the period), the local clock diverges from the reference clock due to clock drift. This potential divergence should be considered to compute the precision of the local clock with respect to the reference clock. When we then consider the clocks of two nodes (two different clients of the synchronization protocol), the precision of their clocks with be larger than the precision of each of theirs clocks with the reference clock, which is larger than the accuracy provided by the clock synchronization mechanism. So,from this perspective, the assumption we make is pretty reasonable.

The question here is whether all this discussion should be included, of course, in a formal version, in the specification.

I understand that. The problem is that the sentence talks about implementation-specific aspects in the middle of the specification, and it is consufing. I would extend the sentence making clear that in practice, these parameters would be approximated and we expect PRECISION>>ACCURACY.

This is the sentence I am talking about:

"The reason for not adopting ACCURACY as a system parameter is the assumption that PRECISION >> ACCURACY. This allows us to consider, for practical purposes, that the PRECISION system parameter embodies the ACCURACY model parameter."

Something I do not understand is why the specification needs both, given that ones follows from the other. But I guess that's a different discussion.

I do not think it is clear what MSGDELAY exactly means until proposalReceptionTime(p,r) is defined. My problem is that before this definition, it is unclear whether MSGDELAY only accounts for the message delivery delay (in the network sense) or also for the process receiving the proposal having entered the corresponding height and round.

So, MSGDELAY does not account for process synchronization, namely, for the distinct times process join a round of consensus. Processes joining rounds at different times is a problem for the consensus algorithm in general, not a limitation introduced by PBTS. Tendermint consensus assumes partial synchrony to ensure that eventually processes are in the same round long enough to propitiate progress.

Being more precise here. The PBTS algorithm assumes that we register the receive time of every PROPOSAL message, to then compare it with the timestamp of the proposal using the timely predicate, in order to decide whether to vote for the proposed value or reject it (vote for nil). But the implementation of the algorithm only considers a PROPOSAL messages that belong to the current round of consensus. Actually, it only considers the first PROPOSAL message of round r delivered to the consensus implementation while it is on the same round r.

In my understanding, it does account for process synchronization: MSGDELAY has to be sufficiently large to account for it, otherwise you won't consider the PROPOSAL from a correct process timely. This is actually what the definition of proposalReceptionTime(p,r) says:

proposalReceptionTime(p,r) is the time p reads from its local clock when p is at round r and receives the proposal of round r.

My point is that this is not clear from the beginning and the use of phrases like receiving a proposal are not precise. When the text says receiving a proposal it actually means receiving the message and being in the height and round.

From point (3)

What's (3)?

"As r > v.round, we can affirm that v was not produced in round r".

Each proposal is unique, and it carries the round at which it was first proposed. The same proposal can be proposed several times, provided that in the first time it was proposed it received 2f+1 PREVOTES, thus making it a valid proposal. If a proposal produced in round r does not receive enough Prevotes, it will never be re-proposed in future rounds.

We defined proposal as tuple (value, timestamp, round) to make the assumption that proposed values are unique clear. In the implementation this is true, as identical blocks end up being different proposals, as they are signed by different validators. If the same validator is the proposer of multiple rounds, the produced proposals will differ by their timestamps.

What does it prevent a Byzantine propose to pick the value of a proposal and repropose as new? It is that the value is signed by the validator thus making two proposals always unique? If that's the case, I think you need to bring that implicit assumption up to the specification, otherwise the proof is flawed. (Maybe it is there already and I missed it)

tendermint / tendermint

PBTS spec feedback (Manuel) #8628