@AlvinHon observes that ParallelChain protocol validators often reach a steady state where it lags 1-2 blocks behind the other validators and never gets to participate in consensus decisions (it never votes).
Problem
After some sequence diagram analysis, we identified that this is caused by a scenario where:
A validator’s (“lagging validator”) blockchain and cur_view is lagging behind the quorum.
It receives a proposal with block.justify.view > cur_view, causing ProgressMessageStub::recv to return with ReceivedQCFromFuture and the lagging validator to go into sync.
The lagging validator sends a SyncRequest to an up-to-date validator. However, at the point of receiving the request, the up-to-date validator is still executing validate_block on the same proposal and therefore block has not been inserted into its block tree. Thus, the SyncResponse it sends only includes the chain up to block’s parent (parent_block).
The lagging validator exits sync and re-enters progress mode at view parent_block.justify.view + 1, which in the steady state is the view parent_block was proposed in.
An up-to-date validator finishes validating block, moves on to the next view, becomes the leader; and sends out a proposal containing child_block.
The lagging validator receives this new proposal, but since child_block.justify.view > parent_block.justify.view + 1, it complains that it ReceivedQCFromFuture and re-enters sync mode again. And the cycle repeats.
Proposed solution
The proposed solution makes three changes:
Enter progress mode at highest_qc().view + 2 instead of highest_qc().view + 1. This makes more sense because in the steady state, the highest QC’s view is the view the highest known block’s parent was proposed in, highest_qc().view + 1 is the view the highest known block’s parent was proposed in, and highest_qc().view + 2 is the view the next block should be proposed in.
Put the message which triggers a ReceivedQCFromFuture into the ProgressMessageStubmsg_buffer instead of throwing it away by returning an error immediately.
Return ReceiveQCFromFuture if block.justify.view == cur_view.
The first change causes the lagging validator to re-enter progress mode at the view block was proposed. The second change makes it such that the lagging validator will recv the proposal containing block in this view and insert block into the block tree. The cumulative effect is that when the lagging validator receives child_block, it will pass safe_block and be vote for it.
The third change is not strictly necessary for this fix, but is reasonable because block.justify.view == cur_view if and only if a consensus decision has already been reached in cur_view.
Background
@AlvinHon observes that ParallelChain protocol validators often reach a steady state where it lags 1-2 blocks behind the other validators and never gets to participate in consensus decisions (it never votes).
Problem
After some sequence diagram analysis, we identified that this is caused by a scenario where:
cur_view
is lagging behind the quorum.block.justify.view > cur_view
, causingProgressMessageStub::recv
to return withReceivedQCFromFuture
and the lagging validator to go into sync.SyncRequest
to an up-to-date validator. However, at the point of receiving the request, the up-to-date validator is still executingvalidate_block
on the same proposal and thereforeblock
has not been inserted into its block tree. Thus, theSyncResponse
it sends only includes the chain up toblock
’s parent (parent_block
).parent_block.justify.view + 1
, which in the steady state is the viewparent_block
was proposed in.block
, moves on to the next view, becomes the leader; and sends out a proposal containingchild_block
.child_block.justify.view > parent_block.justify.view + 1
, it complains that itReceivedQCFromFuture
and re-enters sync mode again. And the cycle repeats.Proposed solution
The proposed solution makes three changes:
highest_qc().view + 2
instead ofhighest_qc().view + 1
. This makes more sense because in the steady state, the highest QC’s view is the view the highest known block’s parent was proposed in,highest_qc().view + 1
is the view the highest known block’s parent was proposed in, andhighest_qc().view + 2
is the view the next block should be proposed in.ReceivedQCFromFuture
into theProgressMessageStub
msg_buffer
instead of throwing it away by returning an error immediately.ReceiveQCFromFuture
ifblock.justify.view == cur_view
.The first change causes the lagging validator to re-enter progress mode at the view
block
was proposed. The second change makes it such that the lagging validator willrecv
the proposal containingblock
in this view and insertblock
into the block tree. The cumulative effect is that when the lagging validator receiveschild_block
, it will passsafe_block
and be vote for it.The third change is not strictly necessary for this fix, but is reasonable because
block.justify.view == cur_view
if and only if a consensus decision has already been reached incur_view
.