PBTS: should synchrony parameters be adaptive?

cason commented 3 years ago

In order to ensure that blocks have timestamps that reflect the actual real time at which a block was proposed, PBTS adopts two synchrony parameters for validating proposal times:

PRECISION bounds the different of values read from clocks of processes at the same instant of real time. In terms of validating proposals, it is meant to prevent Byzantine validators from assigning to proposal times in the future.
MSGDELAY bounds the time it takes for a Proposal message be delivered to all (correct) validators. In terms of validating proposals, it is meant to prevent Byzantine validators from assigning to proposals times in the past.

While meant to prevent Byzantine proposers from assigning arbitrary times to proposed values, an equivocate choice of values for this parameters can represent a liveness issue. This is particularly relevant for the MSGDELAY parameter, expected to be a conservative estimation of the maximum delay for broadcasting a Proposal. If a Tendermint network adopts a value for the MSGDELAY parameter that is likely to be violated in real executions, correct validators may reject proposals from correct proposers because they are deemed untimely. More specifically, a correct validator will assume that the proposer assigned a timestamp in the past to the proposal, a malicious behavior, while the proposer was actually correct but the Proposal took longer than the expected to be received.

An alternative to circumvent this scenario, i.e. bad choices for synchronous parameters, is to relax the assumptions as rounds of consensus fail, so that eventually the timely requirements will not prevent a correct validator from accepting a value proposed (and timestamped) by a correct proposer.

A reference of how to relax timing parameters is already present in Tendermint, as the duration of the timeouts is a increasing function of the round number. As rounds fail, and new rounds are required, the timeouts become longer, so that eventually they stop expiring prematurely. The same approach could be adopted for the synchrony requirements for PBTS, but we would need to specify how the bounds will be relaxed as the round number increases.

Another alternative, mentioned in discussions, is to restrict the adoption of timing requirements to validate proposal times to a number of rounds, and then drop them. This would represent a similar approach, conceptually speaking, but instead of slowly relaxing the requirements, they will "fully" relaxed (dropped) at some point.

williambanfield commented 2 years ago

My first response is to strongly agree that there should be a way to adapt the synchrony parameter over time. This is effectively a hard requirement. Without it, misconfigured consensus parameters can not easily be changed. I.e., if you set the MSGDELAY too low, how do you create a timely proposal for changing the MSGDELAY?

If I'm reading correctly, there are two possible proposals here:

Drop the timely check after a series of rounds, perhaps configurable by the network.
Increase the MSGDELAY bound by some IncreaseΔ (either linear or exponential).

While proposal (1) was partially of my idea, I'm now more in favor if of (2). (2) allows for the liveness needed to update potentially misconfigured values while not completely losing synchrony in bad rounds. It also allows you to calculate how large the synchrony bound was for each height by using the MSGDELAY and Δ value along with the round number that the block was produced at. This has a few nice properties. First, networks can tune their MSGDELAY using this information to make the network produce blocks faster by enlarging the MSGDELAY. Second, external clients can also calculate how 'skewed' the network's clock may be which may be useful for calculations using time.

josef-widder commented 2 years ago

In principle there would also be option:

After a series of rounds, Increase the MSGDELAY bound by some IncreaseΔ (either linear or exponential).

Proposal (1) we should rule out as for such instances there will be no relation between block time and real time (or reference time) can be ensured. (2) ensures that there is a relation. The potential difference just becomes larger.

(3) might be the most complex one to implement. But it would allow the keep the tight synchronization, e.g., in the case where the proposer of round 0 is crashed. At the same time it will ensure liveness after some rounds.

cason commented 2 years ago

While I think that using adaptive synchronous parameters means giving up from synchrony, I see the need for a way to prevent a system from halting due to bad choices of parameters.

After sleeping on that matter, I tend to be more aligned with the option (3) proposed by Josef. Like option (1), it has no impact on the regular operation of the protocol, as the original (static) timely predicate will be adopted in practice. But, unlike option (1), we don't fully drop all timing bounds (and guarantees) in case of bad choices of parameters.

For me, two questions remain open:

When, at each round, start relaxing the MSGDELAY parameter? Is it worth allow this to be also a parameter?
Once started, how progressively this relaxation should occur?

For the first question, we might consider the number of failed rounds that indicate that is actually MSGDELAY the reason for the non success of the previous rounds. Should be something related to the number n of validators, should it be fixed?

For the first question, the possibilities are really open. We can just double MSGDELAY after reaching this round limit, then doubling it again after some other higher-numbered round (maybe the same number of rounds an in the first case)? I mentioned this mechanism because I don't think that relaxing the requirements at each new round is a good approach, again, as rounds tend to fail due to another reasons.

In terms of implementation, I don't think it would be complicated, while it can be precisely defined. I assume that something like: if r > round_limit then add to the upper bound for v.time some value that is function of MSGDELAY and r.

josef-widder commented 2 years ago

OK, then let's introduce 'round_limit' and let's double the factor of 'MSGDELAY' in the timely function every 'round_limit' rounds.

I guess 'round_limit' should be the same at all validators, so we might need it to make a consensus parameter? @williambanfield, what do you think?

cason commented 2 years ago

Maybe we can take the opportunity to adopt more clear/understandable names for this parameters. A first proposal:

CLOCK_PRECISION
PROPOSAL_DELAY (but, to be more precise it would be latency)
Some better name for 'round_limit'"

williambanfield commented 2 years ago

I think that the idea of adding a 'round limit' value at which the MSGDELAY is doubled is generally a good idea. I'm not sure I think that we should actually give this to validators/networks as a tunable parameter. More tunable parameters means more confusion for developers and chains and it's reasonably unclear how, as an application developer or validator on a chain, this parameter will actually affect the chain. I think it would be fine for us to pick a reasonable value for this, or do it in terms of the size of the validator set.

A possible option would be to do it after 1/3+ of the validator set has had a chance to propose. This way, there must have been some correct process that attempted to propose but that was not able to. Thoughts?

cason commented 2 years ago

Not really a suggestion, but this value or parameter must be known by light clients or anyone planning to use block times for something. Associating it to the number of validators seems reasonable, but this also means that its value will be dynamic...

cason commented 2 years ago

A point for having this round limit as a parameter is to consider other uses it can have (possibly in the future), other than relaxing synchronous requirements.

cmwaters commented 2 years ago

I think it would be fine for us to pick a reasonable value for this, or do it in terms of the size of the validator set.

I do think that this makes sense and I like the idea of doubling it after 1/3+ of the validating power have proposed untimely headers.

Not really a suggestion, but this value or parameter must be known by light clients or anyone planning to use block times for something

At the moment the light client never reads the consensus params (which it definitely could do) to understand the evidence age (and derive the trusting period from) nor to understand the timing params (which it uses as part of verification). They are currently all manually set by the operator and need to be updated each time these parameters change. We could definitely consider something more dynamic in the future.

A point for having this round limit as a parameter is to consider other uses it can have (possibly in the future), other than relaxing synchronous requirements.

I think there's probably some validity to what your'e saying but I'm more of the opinion that we shouldn't be using future uses that we have currently don't have an idea about as justification for adding parameters. Adding fields is generally a lot easier than removing them so it's often better to wait until there are actual solid use cases before introducing a field.

cason commented 2 years ago

doubling it after 1/3+ of the validating power have proposed untimely headers

I don't think that this proposal is possible, because it requires knowledge of why I round didn't succeed. And this, at my understanding, will be an internal derivation that will be hard to "export". Of course, we can consider an approximation of that 1/3+ by considering n/3 rounds, while n is the current number of active validators.

josef-widder commented 2 years ago

If adding it as parameter adds too much complexity, can we just fix round_limit to 10?

cason commented 2 years ago

If adding it as parameter adds too much complexity, can we just fix round_limit to 10?

While I agree with this solution, I still have to ask: why 10?

williambanfield commented 2 years ago

I don't think that this proposal is possible, because it requires knowledge of why I round didn't succeed. And this, at my understanding, will be an internal derivation that will be hard to "export". Of course, we can consider an approximation of that 1/3+ by considering n/3 rounds, while n is the current number of active validators.

To clarify here, I don't think we need to actually know why the consensus round failed for us to use it, since it's just an estimate. Picking a value like 10 is fine with me as well. This value just exists to ensure liveness in the case of misconfiguration, not to really be theoretically perfect.

josef-widder commented 2 years ago

Not sure about 10 ;-) From a theoretic viewpoint liveness should be ensured for all values of round_limit, no?

We could also try to actually find a good approximation for +1/3, e.g., assuming the current voting powers, how long it takes to complete "round-robin" that includes all validators as proposer, and perhaps try to understand what that means if the number of validators goes from 150 to 200?

ancazamfir commented 2 years ago

iiuc, we want to start increasing MSGDELAY after round_limit rounds. If the chain has a good MSGDELAY value than we should rarely see this mechanism kicking in (probably in periods of high network latency). If the chain was misconfigured with a very low MSGDELAY multiple rounds and adaptive MSGDELAY will be seen frequently. In this case, we agree that the long term solution is chain upgrade with better parameters? And the adaptive mechanism is a diagnostic and temporary solution? Is there any user feedback we offer that could help understand the different cases? (maybe we should start taking advantage of the consensus events and offer some tooling around them?) I know I asked this before but from my experience, chains tend to configure pretty conservative values. We frequently see for example IBC clients configured with 10sec-2hr clock drift. imho it is more likely that MSGDELAY will be many times larger than what it should. What are the implications in this case?

cason commented 2 years ago

If the chain has a good MSGDELAY value than we should rarely see this mechanism kicking in (probably in periods of high network latency).

It should never. MSGDELAY was thought as a synchronous parameter, which violation means that the system does not adhere to the model.

In this case, we agree that the long term solution is chain upgrade with better parameters? And the adaptive mechanism is a diagnostic and temporary solution?

Definitely. The proposed approach is a workaround for tolerating bad consensus parameters. The point that @williambanfield pointed from the beginning is that it will not be possible to update MSGDELAY parameter if we are not able to accept blocks due to a previous bad choice for this parameter.

cason commented 2 years ago

Is there any user feedback we offer that could help understand the different cases?

This is a very good point, that we should address. My opinion in this regard is that a bad MSGDELAY should put the system in clearly impractical operation, with successive heights taking tens of rounds to commit. The operator should, at this point, understand what is going on, why the system became so slow.

This was one of the reasons for which I prefer the approach of not making this parameter actually adaptive, i.e., not updating it often, after each failed round, so that when it is relaxed (after round_limit rounds) it should be noticeable by the user.

williambanfield commented 2 years ago

I came up with a simple prototype for using the amount of voting power seen on the network to determine the synchrony for the block timestamp. This prototype can be seen here:

https://github.com/tendermint/tendermint/commit/017d45f8e3782150ca84e86fff8d0ecbc007ec04

This would add the requirement that light clients wishing to verify the synchrony of a particular block be able to determine the validator 'priorities' in a trusted way at a given height. This can be queried via the /validators rpc endpoint, but I'm not sure if we consider this trusted for light clients?

The light client would then run the proposer selection algorithm starting at round 0 up to the round at which the block was decided, keeping track of the total amount of voting power seen. Every time the light client observes that 1/3+ of the voting power has proposed, it would add an additional MESSAGEDELAY to the synchrony that it assumes that the block has.

This mechanism adds some complexity to light clients over the hardcoded alternative, but has the upside of being more capable of adapting to the size of the network. I think that if we don't mind adding this complexity to light clients, this is a somewhat cleaner alternative.

cason commented 2 years ago

Nice work, I have to say that his solution looks surprisingly simple, in the sense that I assumed it would require more lines, more complex tricks.

From a specification view-point, however, I would prefer to alter the TimingParams field provided to the method, since it seems to be passed by value, instead of making this computation (MESSAGEDELAY *= k) inside the method. This would allow generating a log message when we passed the round limit, and so are relaxing the timing parameters. This also would make clear to who is reading the code why we, in some circumstances, relaxing the timing parameters, as in the proposed implementation we would have thirdsSeen == 0 in most regular cases.

In other words, I would made this change more visible to the code reader, which is a least clever solution, but clearer.

cason commented 2 years ago

From the light client point of view, my first question is: how much it cares about the MESSAGEDELAY parameter used?

In the regular operation (one or few rounds per height) we don't need all this complexity. But if the round at which a block was decided in not zero (or is above some limit), the client should ask himself why, and then use the method to retrieve how many times MESSAGEDELAY was relaxed.

Again, I think this solution is great, but I would make it clearly a non-regular case, something inside an if clause in order to make this clear, to possibly generate some log message, and allow the client to react to this, maybe skipping this block as it is not as reliable as a block decided in round zero (one, or two).

josef-widder commented 2 years ago

I think what we should do now is to precisely specify what we can expect from a block, e.g., if the commit is for round 1 the time is within x of real-time and if the commit for round 100, the time is within y. Given that trusting-period << unbonding_period we don't need to change anything now as the difference between y and x may be in the order of seconds, while the periods differ in the order of days.

However, in a next step, once we have the precise properties of the time in the block with respect to the round number, we can plan to add complexity to the lightclient and/or the analysis of the lightclient security.

williambanfield commented 2 years ago

Nice work, I have to say that his solution looks surprisingly simple, in the sense that I assumed it would require more lines, more complex tricks.

I was worried about this as well, which is part of why I did the prototype: I wanted to see if it was really that bad, although, the complexity is somewhat masked by the fact that the proposer-selection algorithm is already implemented here.

This would allow generating a log message when we passed the round limit, and so are relaxing the timing parameters.

Yeah, we should definitely log this to the user and make it clear when it changes. The updated IsTimely method was just to quickly determine if this was possible but Im very open to making it more clear.

I would make it clearly a non-regular case, something inside an if clause in order to make this clear, to possibly generate some log message, and allow the client to react to this, maybe skipping this block as it is not as reliable as a block decided in round zero (one, or two).

Agreed, this is not the regular case and we can make that obvious to users and readers of the code. I am waiting on user feedback from 1-2 application developers, but still believe that this 'thirdsSeen' mechanism is a good way forward.

williambanfield commented 2 years ago

After further consideration, I think that my solution will not work for the following reason: My solution relies on the ProposerPriority field of the validator set. This field is returned in the response from the /validators RPC endpoint. This priority value is needed to determine how much voting power of the network proposed by some round. This means that a light client could fetch this value from an RPC endpoint. However, this field is not hashed into the validator set hash that is included in the block and is therefore not actually verifiable. Therefore, light clients could fetch this value but could not actually trust it.

Since this solution is proving to be even more difficult than I had anticipated I would advocate against using it and instead using a simple hardcoded value. To build confidence in this decision, I spoke with @ValarDragon from Osmosis who seem similarly doesn't see much issue with the hardcoded value. I also spoke with @liamsi who is still investigating but seems to be leaning in a similar direction. Let's hardcode 10 rounds before relaxation. We can re-consider during the next release if needed.

williambanfield commented 2 years ago

Implemented in https://github.com/tendermint/tendermint/pull/7739, closing this issue.

tendermint / spec

PBTS: should synchrony parameters be adaptive? #371