relab / hotstuff

MIT License
167 stars 53 forks source link

Add checkpoint feature in HotStuff protocol #66

Open hanish520 opened 2 years ago

hanish520 commented 2 years ago

I am planning to add checkpoint feature in the HotStuff protocol, I came up with two plans to implement this feature.

PiggyBack on the proposal

Pros

Separate Checkpointing service/module

Pros

Cons

meling commented 2 years ago

Some questions.

  1. In the Separate Checkpoint Service approach, who invokes the service? In the PiggyBack approach it is the leader.

  2. Could an alternative design be decentralized? Meaning that each replica take a local snapshot (save the state to local disk) every k commits, based on some configuration that can be adjusted via some state machine command. In some sense that would be similar to the PiggyBack approach, but it would not need to be adjusted every k requests. Perhaps the frequency should only be adjusted if at least 2f+1 replicas confirm. Each replica could be expected to include in some message a hash of the checkpointed state to prove it is still up-to-date.

  3. I guess the point of the checkpoint service would be to allow other replicas to query the latest state of individual replicas. Do I have this right? Or do you also want to allow external entities to retrieve the state?

Comment: I'm officially allergic to tight coupling ;-) But I also like to be pragmatic. That said, I wonder if invoking the same replicas over two different gRPC services (CheckPointing and HotStuff) would actually lead to much extra network overhead. I think the services would reuse the TCP connections, but I guess there is the packet header overhead -- which you could save by piggybacking.

hanish520 commented 2 years ago

Some questions.

  1. In the Separate Checkpoint Service approach, who invokes the service? In the PiggyBack approach it is the leader.

Every replica after 'k' commits can start the process of collecting another 2f+1 checkpoint requests for the same view, a small optimization is it can store the previously received requests and count them as responses and wait for the responses from the remaining replicas.

  1. Could an alternative design be decentralized? Meaning that each replica take a local snapshot (save the state to local disk) every k commits, based on some configuration that can be adjusted via some state machine command. In some sense that would be similar to the PiggyBack approach, but it would not need to be adjusted every k requests. Perhaps the frequency should only be adjusted if at least 2f+1 replicas confirm. Each replica could be expected to include in some message a hash of the checkpointed state to prove it is still up-to-date.

Yes, I guess this is a nice optimization of not having to perform checkpointing every 'k' commits and it can be configured through SMR.

  1. I guess the point of the checkpoint service would be to allow other replicas to query the latest state of individual replicas. Do I have this right? Or do you also want to allow external entities to retrieve the state?

Yes, idea is to see if the app state of the replicas is the latest or not, also it could act as a peer verified starting point for the replicas to rebuild the app state during reconfiguration.

Comment: I'm officially allergic to tight coupling ;-) But I also like to be pragmatic. That said, I wonder if invoking the same replicas over two different gRPC services (CheckPointing and HotStuff) would actually lead to much extra network overhead. I think the services would reuse the TCP connections, but I guess there is the packet header overhead -- which you could save by piggybacking.

I am not sure whether to implement it as a different service or as a separate module reusing the existing service.

johningve commented 2 years ago

I think you might be able to implement this by overriding the acceptor and command queue modules. The command queue sends a "checkpoint" command every k views or if the previous checkpoint was unsuccessful. The Proposed() method of the acceptor may be used to confirm that a QC was created for the checkpoint command.