Closed angbrav closed 1 year ago
- I do not think it is clear what MSGDELAY exactly means until proposalReceptionTime(p,r) is defined. My problem is that before this definition, it is unclear whether MSGDELAY only accounts for the message delivery delay (in the network sense) or also for the process receiving the proposal having entered the corresponding height and round.
So, MSGDELAY
does not account for process synchronization, namely, for the distinct times process join a round of consensus. Processes joining rounds at different times is a problem for the consensus algorithm in general, not a limitation introduced by PBTS. Tendermint consensus assumes partial synchrony to ensure that eventually processes are in the same round long enough to propitiate progress.
Being more precise here. The PBTS algorithm assumes that we register the receive time of every PROPOSAL
message, to then compare it with the timestamp of the proposal using the timely
predicate, in order to decide whether to vote for the proposed value or reject it (vote for nil
). But the implementation of the algorithm only considers a PROPOSAL
messages that belong to the current round of consensus. Actually, it only considers the first PROPOSAL
message of round r
delivered to the consensus implementation while it is on the same round r
.
- I have some problems with the assumption that PRECISION>>ACCURACY.
This is really an assumption. We expect operators to chose a very pessimistic value for PRECISION
. So, if validators have their clock synchronized (e.g., using NTP) with a given ACCURACY
with real time, we expect PRECISION
to be chosen as value reasonable larger than 2*ACCURACY
.
Extending a little this discussion, that was simplified over the revisions of this document. When you synchronize your local clock with a external source of time, you are trying to synchronized two local clocks. We assume that one of this local clocks, the source of time, is periodically synchronized with a trusted source of real time (e.g., GPS, atomic clock, etc.). When the synchronization method provides your a given "accuracy", it actually tries to compute the precision between the client (node) clock and the server (e.g., NTP) clock and outputs this value with a reasonably high confidence interval. This value is what we are calling ACCURACY
here. The synchronization procedure is not continuous, it is performed with a given regularity. During the interval between two synchronizations (usually called the period), the local clock diverges from the reference clock due to clock drift. This potential divergence should be considered to compute the precision of the local clock with respect to the reference clock. When we then consider the clocks of two nodes (two different clients of the synchronization protocol), the precision of their clocks with be larger than the precision of each of theirs clocks with the reference clock, which is larger than the accuracy provided by the clock synchronization mechanism. So,from this perspective, the assumption we make is pretty reasonable.
The question here is whether all this discussion should be included, of course, in a formal version, in the specification.
From point (3)
"As r > v.round, we can affirm that v was not produced in round r".
Each proposal is unique, and it carries the round at which it was first proposed. The same proposal can be proposed several times, provided that in the first time it was proposed it received 2f+1 PREVOTES
, thus making it a valid proposal. If a proposal produced in round r
does not receive enough Prevotes, it will never be re-proposed in future rounds.
We defined proposal as tuple (value, timestamp, round) to make the assumption that proposed values are unique clear. In the implementation this is true, as identical blocks end up being different proposals, as they are signed by different validators. If the same validator is the proposer of multiple rounds, the produced proposals will differ by their timestamps.
I think we need a case split here.
On the first bullet of point (3). I am not sure if I understand what you suggested. Who is p
in your comments?
This assumes that the proposer picks up an existing value with vr != -1. We should consider other cases, even if to discard them by showing that are unfeasible.
By other cases here you mean vr = -1
? In this case, the timely
predicate will be evaluated for the value, and therefore we are not talking about a derived POL but of a timely POL.
Notice that v.round
does not exist in the algorithm, it is just an artifice the we create to prove the algorithm. What matters for the algorithm is whether vr = -1
, so we need to do a full check of the proposed value, including its timestamp, or vr > -1
, in this case we skip the timestamp check provided we know a POL(v, vr)
.
Once you prove that there exists a POL(v, vr) such that vr<r, then the required follows trivially from the induction hypothesis.
The existence of POL(v,vr)
is part of the assumption in this step: if a correct process cast a PREVOTE(r,v)
upon receiving a PROPOSAL(r,v,vr)
, then the processes knows a POL(v,vr)
. Otherwise, it is not following the algorithm.
- (Extra) It would be nice to have an intuition why safety and liveness hold.
The correct arguing is the timely and derived POL part. The safety here is a formal representation of the [Time-Validity] property introduced by PBTS. The liveness is just a scenario that enables [Time-Validity] to be observed by all correct processes. It is an "enough but not necessary" condition.
I confess I don't like these two properties that much, but they were conceived from the TLA+ specification.
- I have some problems with the assumption that PRECISION>>ACCURACY.
This is really an assumption. We expect operators to chose a very pessimistic value for
PRECISION
. So, if validators have their clock synchronized (e.g., using NTP) with a givenACCURACY
with real time, we expectPRECISION
to be chosen as value reasonable larger than2*ACCURACY
.Extending a little this discussion, that was simplified over the revisions of this document. When you synchronize your local clock with a external source of time, you are trying to synchronized two local clocks. We assume that one of this local clocks, the source of time, is periodically synchronized with a trusted source of real time (e.g., GPS, atomic clock, etc.). When the synchronization method provides your a given "accuracy", it actually tries to compute the precision between the client (node) clock and the server (e.g., NTP) clock and outputs this value with a reasonably high confidence interval. This value is what we are calling
ACCURACY
here. The synchronization procedure is not continuous, it is performed with a given regularity. During the interval between two synchronizations (usually called the period), the local clock diverges from the reference clock due to clock drift. This potential divergence should be considered to compute the precision of the local clock with respect to the reference clock. When we then consider the clocks of two nodes (two different clients of the synchronization protocol), the precision of their clocks with be larger than the precision of each of theirs clocks with the reference clock, which is larger than the accuracy provided by the clock synchronization mechanism. So,from this perspective, the assumption we make is pretty reasonable.The question here is whether all this discussion should be included, of course, in a formal version, in the specification.
I understand that. The problem is that the sentence talks about implementation-specific aspects in the middle of the specification, and it is consufing. I would extend the sentence making clear that in practice, these parameters would be approximated and we expect PRECISION>>ACCURACY.
This is the sentence I am talking about:
"The reason for not adopting ACCURACY as a system parameter is the assumption that PRECISION >> ACCURACY. This allows us to consider, for practical purposes, that the PRECISION system parameter embodies the ACCURACY model parameter."
Something I do not understand is why the specification needs both, given that ones follows from the other. But I guess that's a different discussion.
- I do not think it is clear what MSGDELAY exactly means until proposalReceptionTime(p,r) is defined. My problem is that before this definition, it is unclear whether MSGDELAY only accounts for the message delivery delay (in the network sense) or also for the process receiving the proposal having entered the corresponding height and round.
So,
MSGDELAY
does not account for process synchronization, namely, for the distinct times process join a round of consensus. Processes joining rounds at different times is a problem for the consensus algorithm in general, not a limitation introduced by PBTS. Tendermint consensus assumes partial synchrony to ensure that eventually processes are in the same round long enough to propitiate progress.Being more precise here. The PBTS algorithm assumes that we register the receive time of every
PROPOSAL
message, to then compare it with the timestamp of the proposal using thetimely
predicate, in order to decide whether to vote for the proposed value or reject it (vote fornil
). But the implementation of the algorithm only considers aPROPOSAL
messages that belong to the current round of consensus. Actually, it only considers the firstPROPOSAL
message of roundr
delivered to the consensus implementation while it is on the same roundr
.
In my understanding, it does account for process synchronization: MSGDELAY has to be sufficiently large to account for it, otherwise you won't consider the PROPOSAL from a correct process timely. This is actually what the definition of proposalReceptionTime(p,r)
says:
proposalReceptionTime(p,r)
is the time p reads from its local clock when p is at round r and receives the proposal of round r.
My point is that this is not clear from the beginning and the use of phrases like receiving a proposal are not precise. When the text says receiving a proposal it actually means receiving the message and being in the height and round.
From point (3)
What's (3)?
"As r > v.round, we can affirm that v was not produced in round r".
Each proposal is unique, and it carries the round at which it was first proposed. The same proposal can be proposed several times, provided that in the first time it was proposed it received
2f+1 PREVOTES
, thus making it a valid proposal. If a proposal produced in roundr
does not receive enough Prevotes, it will never be re-proposed in future rounds.We defined proposal as tuple (value, timestamp, round) to make the assumption that proposed values are unique clear. In the implementation this is true, as identical blocks end up being different proposals, as they are signed by different validators. If the same validator is the proposer of multiple rounds, the produced proposals will differ by their timestamps.
What does it prevent a Byzantine propose to pick the value of a proposal and repropose as new? It is that the value is signed by the validator thus making two proposals always unique? If that's the case, I think you need to bring that implicit assumption up to the specification, otherwise the proof is flawed. (Maybe it is there already and I missed it)
I have gone through the PBTS spec and have a bunch of comments. Instead of polluting PR #8600 with comments not specific to the PR, I have decided to open this issue. Please let me know if there is a better way to do it.
Main comments:
MSGDELAY
exactly means untilproposalReceptionTime(p,r)
is defined. My problem is that before this definition, it is unclear whetherMSGDELAY
only accounts for the message delivery delay (in the network sense) or also for the process receiving the proposal having entered the corresponding height and round. I think making sure this is clear from the beginning is fundamental for the spec. Also, sometimes the spec uses receive, others delivery, which makes things even more confusing.PRECISION>>ACCURACY
. I think an upper-bound onPRECISION
can be derived from (1)|Cp(t) - t| <= ACCURACY
, given (2)|Cp(t) - Cq(t)| <= PRECISION
. So I think we cannot really assumePRECISION>>ACCURACY
. Let me informally elaborate a bit (and please correct me if I am wrong):Cp(t) - t = ACCURACY
andt - Cq(t) = ACCURACY
.Cp(t) - Cq(t) <= 2*ACCURACY
.PRECISION<=2*ACCURACY
.PREVOTE
. Then either (i) p had nothing locked, (ii) was locked on v, (iii) locked on something else at a round < v_r.POL(v,r)
was produced (by hypothesis) we can affirm that at least one correct process (also) observed aPOL(v,v_r)
" This assumes that the proposer picks up an existing value withvr != -1
. We should consider other cases, even if to discard them by showing that are unfeasible.v_r' = v.round
and observe a timely proof-of-lock" I would spell out the inductive argument. Once you prove that there exists aPOL(v, vr)
such thatvr<r
, then the required follows trivially from the induction hypothesis.Minor comments to improve presentation: