Problem Overview
The current implementation of the db checkpoint feature has a synchronization bug:
While we take the db checkpoint in the background, we don't align anything with the checkpoint sequence number, i.e. the block number, bft metadata, pending reserved pages, and more.
Once we put more client requests in a different resolution than 150 we start to see a wide set of issues:
For example:
The recovered replica won't start from a stable checkpoint, instead, it starts from the point where the db checkpoint was taken.
In a case where the checkpoint was taken in the middle of another execution phase, we won't have the pending reserved pages to recover correctly.
We trim the block at the point where the db checkpoint was taken, but we don't update the bft metadata accordingly.
Below is an example of part of these issues:
On replica 0, block 302 was created on sequence number 304
In this PR we propose a wide change that fixes the problem.
Till now, the decision of whether to create a db checkpoint was the primary only. If the primary decided that it's time to create the db checkpoint, it sends a bft command whose execution is, creating a DB checkpoint.
(Note that this approach, regardless of the above bugs, is not safe in terms of DOS attacks, a malicious primary can order the replicas to continuously create db checkpoints).
Here we propose a different solution: the decision to create a db checkpoint is based on a deterministic local event (such as how much time has passed since the last created db checkpoint).
This way, once decided a db checkpoint creation callback is registered to the stable sequence number event. Once the replica reaches this stable sequence number, it starts to create the db checkpoint asynchronously, but now the block number is aligned with the sequence number because it was taken right after the sequence number execution.
To make the above feasible, (1) we cannot rely on local timeouts (instead, we consider only the time being received by consensus), (2) the db checkpoint metadata (such as sequence number and timestamp) has to be shared between all replicas (via reserved pages).
Testing Done
CI + Changing an existing test to verify the changes
The current implementation of the db checkpoint feature has a synchronization bug: While we take the db checkpoint in the background, we don't align anything with the checkpoint sequence number, i.e. the block number, bft metadata, pending reserved pages, and more. Once we put more client requests in a different resolution than 150 we start to see a wide set of issues: For example:
However, on recovery, the recovered replica has the same block that was created on sequence number 305:
In this PR we propose a wide change that fixes the problem. Till now, the decision of whether to create a db checkpoint was the primary only. If the primary decided that it's time to create the db checkpoint, it sends a bft command whose execution is, creating a DB checkpoint. (Note that this approach, regardless of the above bugs, is not safe in terms of DOS attacks, a malicious primary can order the replicas to continuously create db checkpoints). Here we propose a different solution: the decision to create a db checkpoint is based on a deterministic local event (such as how much time has passed since the last created db checkpoint). This way, once decided a db checkpoint creation callback is registered to the stable sequence number event. Once the replica reaches this stable sequence number, it starts to create the db checkpoint asynchronously, but now the block number is aligned with the sequence number because it was taken right after the sequence number execution. To make the above feasible, (1) we cannot rely on local timeouts (instead, we consider only the time being received by consensus), (2) the db checkpoint metadata (such as sequence number and timestamp) has to be shared between all replicas (via reserved pages).
CI + Changing an existing test to verify the changes