spacewalkhq / raft-rs

An understandable, fast and scalable Raft Consensus implementation
MIT License
122 stars 22 forks source link

Log Snapshots #19

Open vaibhawvipul opened 1 month ago

vaibhawvipul commented 1 month ago
dierbei commented 1 month ago

Understanding some of the implementations;

Files, for example, will write the current LogEntry to the Snapshot file and truncate the LogEntry;

Creating a new Snapshot file again will contain the contents of the first snapshot and the new LogEntry, so the file will get bigger and bigger;

Will there be a better way? Thanks for sharing;

rsundar commented 1 month ago

Here is my idea so far,

We want to keep a snapshot that is quick to read and verify and update while keeping it's size somewhat reasonable. Snapshot size will grow with log entries but the growth rate can be reasonably controlled in order for a node to be able keep its hardware requirements to be reasonable.

My idea is as below:

  1. Define a cooldown period where a log entry is assumed to be finalized.
  2. Finalized log entries are written to the snapshot, and the log entry is removed from the log file.

Questions I have:

Should the snapshot be synchronised across different nodes or not?

dierbei commented 1 month ago

Questions I have:

Should the snapshot be synchronised across different nodes or not?

My idea is that each node will have its own snapshot;

If the data of a node is wrong at this time, it will first clear the self-generated data (including the snapshot + LogEntry), after that it will send a request to the Leader, and then the Leader will send the data (Snapshot + LogEntry) to that node.

dierbei commented 1 month ago

My idea is as below:

  1. Define a cooldown period where a log entry is assumed to be finalized.
  2. Finalized log entries are written to the snapshot, and the log entry is removed from the log file.

Yes, I have thought about doing a snapshot at regular intervals as well.

Another idea is to take a snapshot when the LogEntry reaches a certain value;

For example: every thousand entries reached.

dierbei commented 1 month ago

We want to keep a snapshot that is quick to read and verify and update while keeping it's size somewhat reasonable. Snapshot size will grow with log entries but the growth rate can be reasonably controlled in order for a node to be able keep its hardware requirements to be reasonable.

My current idea is to have two ways of storing it, not sure if that's feasible.

  1. using file storage, snapshot files store all data at the current point in time (all data from the start of the program run)

  2. using an embedded database for storage (I don't know if this is feasible, I haven't seen any implementation)

rsundar commented 3 weeks ago

My idea is as below:

  1. Define a cooldown period where a log entry is assumed to be finalized.
  2. Finalized log entries are written to the snapshot, and the log entry is removed from the log file.

Yes, I have thought about doing a snapshot at regular intervals as well.

Another idea is to take a snapshot when the LogEntry reaches a certain value;

For example: every thousand entries reached.

I like this idea a lot, one method could be to have some checkpoints based on finalized log entries, so in case a node falls behind it becomes possible for it to just request a neighbouring node for finalized data from that checkpoint from which it fell behind.

Taking ideas from public blockchains I would like to implement this for the snapshots:

https://dl.acm.org/doi/10.1145/3651890.3672219

I would store the snapshot as a key value pair in something like a local rocksDB instance.

Where key is a monotonically increasing checkpoint id (could use a UTC timestamp) & value is actual contents of the logEntry that have been 'finalized'. The rateless invertible bloom filter would be built on the coded symbols encoding the difference between a local and remote copy of the node. Allowing for set reconciliation Between a local node and the remote node.

Essentially flow would be:

  1. Open a network connection with current leader
  2. Check set difference by requesting a stream of coded symbols
  3. Local node calculated difference and updates is own set
  4. Close connection.
dierbei commented 3 weeks ago

I would store the snapshot as a key value pair in something like a local rocksDB instance.

Sounds good, using a database would allow for incremental synchronization.

I've been planning to go ahead and implement file snapshots lately, but I've been busy lately.