Add checkpoint feature in HotStuff protocol

hanish520 commented 2 years ago

I am planning to add checkpoint feature in the HotStuff protocol, I came up with two plans to implement this feature.

PiggyBack on the proposal

Upon committing every 'k' requests/views, the leader of the next view adds the checkpoint request in the proposal of the current view.
Replicas after verifying the proposal, updating the state from the QC of the previous view.
if the proposal contains the checkpoint request then verify the checkpoint request and perform a "snapshot of the application".
After checkpointing, send the vote to the leader of the next view.
If a quorum certificate was created and shared in the proposal of the next view then the checkpoint procedure is marked as complete.
If a quorum certificate is not created from the previous view then the leader of the current view conducts the same procedure again until success.

Pros

No additional communication cost in the protocol for checkpointing
Cons
Interferes with the protocol execution
Implementation would be tightly coupled with the protocol.

Separate Checkpointing service/module

After every 'k' commit, the checkpointing service is invoked to send the checkpoint message to all other replicas.
Replica waits for 2f+1 responses and then saves the application snapshot.
This is repeated on all replicas until all of them can checkpoint, this can become a source of DOS attacks, need to limit the number of times a replica can request checkpoint service.

Pros

No interference with the protocol, runs as a separate service of a replica.
Can be switched on/off based on the requirement.

Cons

A additional communication cost of n*n messages.

meling commented 2 years ago

Some questions.

In the Separate Checkpoint Service approach, who invokes the service? In the PiggyBack approach it is the leader.
Could an alternative design be decentralized? Meaning that each replica take a local snapshot (save the state to local disk) every k commits, based on some configuration that can be adjusted via some state machine command. In some sense that would be similar to the PiggyBack approach, but it would not need to be adjusted every k requests. Perhaps the frequency should only be adjusted if at least 2f+1 replicas confirm. Each replica could be expected to include in some message a hash of the checkpointed state to prove it is still up-to-date.
I guess the point of the checkpoint service would be to allow other replicas to query the latest state of individual replicas. Do I have this right? Or do you also want to allow external entities to retrieve the state?

Comment: I'm officially allergic to tight coupling ;-) But I also like to be pragmatic. That said, I wonder if invoking the same replicas over two different gRPC services (CheckPointing and HotStuff) would actually lead to much extra network overhead. I think the services would reuse the TCP connections, but I guess there is the packet header overhead -- which you could save by piggybacking.

hanish520 commented 2 years ago

Some questions.

In the Separate Checkpoint Service approach, who invokes the service? In the PiggyBack approach it is the leader.

Every replica after 'k' commits can start the process of collecting another 2f+1 checkpoint requests for the same view, a small optimization is it can store the previously received requests and count them as responses and wait for the responses from the remaining replicas.

Could an alternative design be decentralized? Meaning that each replica take a local snapshot (save the state to local disk) every k commits, based on some configuration that can be adjusted via some state machine command. In some sense that would be similar to the PiggyBack approach, but it would not need to be adjusted every k requests. Perhaps the frequency should only be adjusted if at least 2f+1 replicas confirm. Each replica could be expected to include in some message a hash of the checkpointed state to prove it is still up-to-date.

Yes, I guess this is a nice optimization of not having to perform checkpointing every 'k' commits and it can be configured through SMR.

I guess the point of the checkpoint service would be to allow other replicas to query the latest state of individual replicas. Do I have this right? Or do you also want to allow external entities to retrieve the state?

Yes, idea is to see if the app state of the replicas is the latest or not, also it could act as a peer verified starting point for the replicas to rebuild the app state during reconfiguration.

Comment: I'm officially allergic to tight coupling ;-) But I also like to be pragmatic. That said, I wonder if invoking the same replicas over two different gRPC services (CheckPointing and HotStuff) would actually lead to much extra network overhead. I think the services would reuse the TCP connections, but I guess there is the packet header overhead -- which you could save by piggybacking.

I am not sure whether to implement it as a different service or as a separate module reusing the existing service.

johningve commented 2 years ago

I think you might be able to implement this by overriding the acceptor and command queue modules. The command queue sends a "checkpoint" command every k views or if the previous checkpoint was unsuccessful. The Proposed() method of the acceptor may be used to confirm that a QC was created for the checkpoint command.

relab / hotstuff