Handle replica in 'fall behind' scenario who is in a view change

jyellick commented 8 years ago

Replicas listen to the network chatter, to discover if they are falling behind and perform state transfer. This was outlined and reviewed here.

This works quite effectively to move the watermarks, and has been functioning effectively.

However, as @corecode pointed out, this logic sets the water marks, but does not set the view. For a replica which is attempting to 'catch up', it may have already initiated a view change because it felt its requests were being ignored, or, the network may have moved onto some much further view.

As of PR #1111 a replica which moves its watermarks and initiates state transfer will set its current view to be active. This makes sense intuitively, that if the replica was out of date, it might have mistakenly issued a view change even though the leader was behaving in a non-byzantine way. If a view change never occurs, then this replica would never begin participating in the network.

I'd like one of our PBFT protocol experts, like @vukolic to weigh in on how a replica which is behind should catch up and determine the correct view to become active in. This is related to a similar question, should there be some sort of PBFT handshaking when a replica reconnects to a network to simplify this process?

corecode commented 8 years ago

How about this: when we initiate state transfer to a specific checkpoint, we advance our low watermark in anticipation of the state transfer completing successfully. When receiving messages, we do not discard messages that are inWV, but queue them. Maybe we even process them to some extent while we are still syncing. That should allow us to observe view changes and other quorums, which should allow us to advance our view accordingly.

vukolic commented 8 years ago

This is conceptually sound.

Simon Schubert --- Re: [hyperledger/fabric] Handle replica in 'fall behind' scenario who is in a view change (#1120) --- From:"Simon Schubert" notifications@github.comTo:"hyperledger/fabric" fabric@noreply.github.comCc:"Marko Vukolic" mvu@zurich.ibm.comDate:Mon, Apr 18, 2016 12:50Subject:Re: [hyperledger/fabric] Handle replica in 'fall behind' scenario who is in a view change (#1120)

How about this: when we initiate state transfer to a specific checkpoint, we advance our low watermark in anticipation of the state transfer completing successfully. When receiving messages, we do not discard messages that are inWV, but queue them. Maybe we even process them to some extent while we are still syncing. That should allow us to observe view changes and other quorums, which should allow us to advance our view accordingly. —You are receiving this because you were mentioned.Reply to this email directly or view it on GitHub

jyellick commented 8 years ago

@corecode This is currently how the code behaves, skipTo moves the watermark, and sets the internal PBFT lastExec, and informs the executor of the state transfer. Then PBFT sends execution requests to the executor which queue in the hope that state transfer will complete before the queue size is exhausted.

@vukolic

I'm not sure how this handles the view change scenario? 4 replicas, one is temporarily partitioned, so it believes it is being ignored, so sends a view change, advances its view counter, and sets the current view to inactive. Then, when it reconnects, it will ignore most messages (because they are likely not in the current view, and outside the watermarks), but it will see checkpoints which are above watermarks, and realize it is behind. It will find a weak checkpoint cert, and attempt to initiate state transfer, but its view currently marked as 'inactive', so it will not participate in consensus.

We can (and now do) set the current view to be active, but this will only work if the view advanced and is in our current view. If the view did not advance, or advanced too far, it seems like we are still in trouble and will not recover until a new view change takes place.

vukolic commented 8 years ago

@jyellick @cca88 @corecode Again, we are discussing behavior that is perfectly acceptable in the original PBFT.

We may try to mitigate this by introducing a replica (periodically, because of possibility of lost msgs) multicasting a SUSPECT message when it would be sending VIEW-CHANGE in PBFT (as in multicast <SUSPECT, v, replicaid>)

Unlike VIEW-CHANGE, SUSPECT would not make a replica enter view v+1, but, the reception of SUSPECT messages for view v (or VIEW-CHANGE for v+1) from f+1 different replicas would trigger a replica to send a VIEW-CHANGE for v+1 ( here VIEW-CHANGE msg is as in the original paper and as implemented now)

EDIT: In fact, the above should read:

"the reception of SUSPECT messages for view v (or any other message in v+1 or higher (incl. VIEW-CHANGE for v+1)) from f+1 different replicas would trigger a replica to send a VIEW-CHANGE for v+1"

jyellick commented 8 years ago

@vukolic So just to concretely state the 'acceptable behavior' in a PBFT context:

A non-byzantine replica may be unable to participate in a network indefinitely, so long as the network is otherwise making progress.

Is this an accurate statement?

vukolic commented 8 years ago

@jyellick exactly. PBFT does not care about progress of an individual replica but of a system as a whole

If we care about individual replica progress we need to accommodate this in a separate way, e.g., as described above

jyellick --- Re: [hyperledger/fabric] Handle replica in 'fall behind' scenario who is in a view change (#1120) --- From:"jyellick" notifications@github.comTo:"hyperledger/fabric" fabric@noreply.github.comCc:"Marko Vukolic" mvu@zurich.ibm.comDate:Mon, Apr 18, 2016 18:54Subject:Re: [hyperledger/fabric] Handle replica in 'fall behind' scenario who is in a view change (#1120)

@vukolic So just to concretely state the 'acceptable behavior' in a PBFT context: A non-byzantine replica may be unable to participate in a network indefinitely, so long as the network is otherwise making progress. Is this an accurate statement? —You are receiving this because you were mentioned.Reply to this email directly or view it on GitHub

jyellick commented 8 years ago

@vukolic I don't think that the SUSPECT proposal you outline above handles all the cases we need it to. This works so long as the view does not actually change. What if instead, a replica loses its network connection, and therefore SUSPECTs from view 1 to view 2, but while it is away, the network advances to view 3. The replica will still not participate in consensus until it next observes a view change.

In some of the discussions around adding additional nodes to the whitelist, there has been mention of forcing a view change to allow this to happen. It seems like a natural extension to this would be for a previously whitelisted replica returning. In this case, that replica could get a consistent perspective of the current state of the network if the network voluntarily performed a view change to accommodate its new member. For liveliness purposes, the frequency with which a replica can request a view change would need to be limited, and further it seems that this view change should not actually advance the primary (as otherwise byzantine replicas could potentially coordinate by leaving and joining the network to monopolize the leader position). What do you think of this possibility?

vukolic commented 8 years ago

I expected the question, but wanted to keep the answer streamlined and to the point :) - as what you mention here is, in a sense, a separate issue.

Indeed, what I proposed above is related to the original problem described in this issue - i.e., a replica falling behind in a given view v stops participating in a view because it is the only one to send VIEW-CHANGE for v+1. The SUSPECT mechanism addresses this.

What needs to complement this SUSPECT mechanism - is, indeed, a mechanism that allows a replica to "catch-up" views. To this end, a reconnecting replica, realizing the system is, say, in view v+6, needs not to intitate view change, but state transfer. Of course, state transfer in this, and any other case needs to update the PBFT view number as well.

To actually detect that the system is in view number v+6, a reconnecting replica may listen to: a) commit certificates, or b) stable checkpoint certificates. It could also poll the other replicas for most recent state

but this would be superfluous.

@jyellick Let me know if this is not clear enough or if you see any additional challenges.

cca88 commented 8 years ago

There are many implementations of Paxos out there, and PBFT is very similar to Paxos. Can one learn from the way they perform such state transfers?

jyellick commented 8 years ago

@vukolic With the SUSPECT mechanism, why would a replica know to send a SUSPECT instead of a VIEW-CHANGE?

Does PBFT rely on the non-participation of replicas which seek a view change in order to encourage the network to change views? I guess what I am getting at, is why issuing a view change request should ever suspend the replica's participation in the current view, and how this is overcome with SUSPECT?

vukolic commented 8 years ago

1) On the one hand, whenever a replica would send VIEW-CHANGE in the original PBFT - it sends a SUSPECT in the modified PBFT.

2) On the other hand, whenever a replica detects f+1 replicas progressing at least up to SUSPECT or beyond (i.e., sending SUSPECT for view v, or VIEW CHANGE or any other message for v+1 or higher) - it sends a VIEW CHANGE for v+1

This is non-blocking, as view would not effectively be changed in the original PBFT unless at least f+1 replicas send VIEW_CHANGE. This would be now replaced by 2)

SUSPECT does not advance views, VIEW-CHANGE does. VIEW-CHANGE has to advance views, as committing request in view v after sending VIEW-CHANGE for v+1 would break invariants on which PBFT view-change relies on, and open a possibility for losing committed requests.

jyellick commented 8 years ago

@vukolic So, if I am understanding this correctly, the proposal is to replace all VIEW-CHANGE messages with SUSPECT messages. This would have the effect of simply not advancing the current view, and leaving the current view marked as active (until f+1 SUSPECTs have been received).

What was the motivation in the original PBFT paper for not operating in this fashion? I understand that variants of PBFT might depend on this view change behavior, but I'm trying to understand why the classical specification would advance its view to a new inactive view, rather than wait for f+1 VIEW-CHANGE messages to do so?

vukolic commented 8 years ago

No this is not the proposal

the proposal is to add SUSPECT messages in addition to VIEW CHANGE, not to replace VIEW CHANGE. Notice that the contents of SUSPECTs and VIEW-CHANGEs are very different.

PBFT did not implement this as individual, specific replica "active" participation in a view was not the original concern, nor part of the original intentions/specification. For PBFT - it is the system progress that matters - not that of an individual replica.

jyellick commented 8 years ago

@vukolic Sorry I missed the difference the first time around. I believe you were saying that instead of sending VIEW-CHANGE messages, a replica would send SUSPECT messages, once f+1 SUSPECTs accumulate, then send a VIEW-CHANGE. In the non-byzantine case, this would mean f+1 VIEW-CHANGE messages, which would trigger a new view, is that correct?

With respect to eavesdropping for the view number. In order for this to be performed in a way which does not exhibit unbounded memory growth, what would you think of when catching up, tracking the highest view attested to by each replica (via any message which contains view information), then picking the f+1st highest reported view (including multiplicity)?

I appreciate the simplicity of the eavesdropping techniques for catching up to the network, as it does not require any protocol modifications, but none of these will work when the network is not processing new transactions. One proposal would be for the network to periodically process a null transaction if no other transactions are present. This would allow for a replica to eventually catch up, even if the network were idle. Still, ultimately to address the white-list modification requirements, it seems like we'll need to come up with a more active approach, so I'm not entirely sold on an eavesdropping approach. What are your thoughts? I'd also invite input from @chetmurthy with his Paxos expertise.

vukolic commented 8 years ago

Not sure what non-Byzantine case means - a separate crash-tolerant protocol that we do not implement now, or a non-Byzantine execution of such a modified PBFT?

Re eavesdropping - I suggest opening a new issue while porting some of this discussion there - and continuing discussion there.

jyellick commented 8 years ago

@vukolic Sorry for the silence on this issue, some other bug fixing ended up taking priority over this.

By 'the non-byzantine case', I mean that all replicas sending SUSPECT are behaving honestly (that SUSPECT messages are truly being broadcast, and not selectively sent). So, once f+1 SUSPECT messages are sent, all replicas should have those messages, and all non-byzantine replicas would then send a VIEW-CHANGE message for a total of more than f+1 VIEW-CHANGES causing the view change.

In the byzantine case, it seems that the byzantine replicas could selectively send SUSPECT messages to attempt to get some replicas to send VIEW-CHANGE while others not, causing the network to stall until a view change. However, this would still require 1 honest replica to have sent a SUSPECT message, which would historically have been a VIEW-CHANGE message, so we should not be introducing any new attack vectors.

I've opened #1454 as you suggested to address some of the eavesdropping and catchup techniques as you suggested.

cca88 commented 8 years ago

Using those additional SUSPECT messages seems more complex than the simple solutions discussed under #1454, therefore I would prefer those.

vukolic commented 8 years ago

@cca88 Discussion here is not exactly the same as #1454 - namely, this issue talks about the more severe case where a replica sends a view-change message - e.g., due to temporary partition. At that moment that replica stops participating in a view and there is no obvious way to "put it back" into that pbft view at that moment

What is discussed in #1454 is a less severe case, where a replica does not send view-change and stays in a view. In principle I agree with mechanisms suggested there - but it should be noted that these are very similar issues of rather different severity for correcting.

My suggestion is - at least in short term - do not implement SUSPECTS so consciously choose not to address this particular issue - if a replica sends a view-change and is alone to do so and the view does not change - let it drag. Mechanisms suggested under #1454 could be implemented as less invasive to address the case where a replica drags but does not send a view change, or when it sends a view change and the view advances.

jyellick commented 8 years ago

@vukolic @cca88 I'd agree that #1454 should be higher priority, and less invasive than the introduction of new messages via SUSPECT. I still think fixing this issue may be important in the long term, but am agreeable to setting it aside for now.

jyellick commented 8 years ago

@vukolic @cca88 @corecode @kchristidis @tuand27613

Per a slack discussion with @vukolic, we can use an alternate mechanism to mitigate this problem. We can add a configurable parameter, for how many checkpoints a leader is allowed to remain as the primary, after which, all replicas are obligated to send a view change.

This solves the problem stated in this issue by ensuring that the view advances, so that the network's view will catch up to the view the replica mistakenly advanced to. This is somewhat the inverse solution to SUSPECT, as SUSPECT attempts to keep that replica from advancing its view beyond the network.

According to @vukolic this is similar to "Aardvark from UT Austin". The clear drawback to this approach, is that view changes necessarily kill network performance, so performing them too often is bad. Making this configurable will mitigate this, but allowing it to be disabled is important.

Finally, this enhancement has also been discussed by some users who are uncomfortable with the idea of a constant leader, regardless of the PBFT promises made for ditching him if byzantine, so this gives us extra functionality as well.

kchristidis commented 8 years ago

From here:

the reception of SUSPECT messages for view v (or any other message in v+1 or higher (incl. VIEW-CHANGE for v+1)) from f+1 different replicas would trigger a replica to send a VIEW-CHANGE for v+1

I think that statement should read: "...to send a VIEW-CHANGE for v+1 or higher."

corecode commented 8 years ago

Complicated phrasing. So far we don't jump views by eavesdropping, except for VIEW-CHANGE messages.

hyperledger-archives / fabric

Handle replica in 'fall behind' scenario who is in a view change #1120