ndragazis / tinykv

Simple key value store based on Seastar
Apache License 2.0
0 stars 0 forks source link

[storage] Store data on disk #4

Open ndragazis opened 1 week ago

ndragazis commented 1 week ago

I need to define how the key-value store shall persist data on disk. I also need to explore the API that seastar provides for disk IO operations.

Things to consider:

ndragazis commented 1 week ago

Popular Strategies

Reading from Martin Kleppmann's book "Designing Data Intensive Applications" (Chapter 3), there are two categories for storage engines:

The log-structured storage engines store the data in log-structured files on disk and construct in-memory indexes to accelerate lookups.

The page-oriented storage engines store data in sorted order in fixed-size pages on disk and maintain a persistent B-tree for fast lookups.

The main difference is that page-structured storage engines perform updates in-place, whereas log-structured storage engines perform append-only operations and remove stale values in later stages (log merging, log compaction).

ndragazis commented 1 week ago

I will follow the LSM-tree approach, which is closer to Scylla.

High-level Plan:

ndragazis commented 1 week ago

Memtable Flushing

When the memtable reaches some size, the storage engine will be doing the following:

  1. Create a new mutable memtable and WAL to serve writes.
  2. Mark the old memtable as immutable to only serve reads.
  3. Start an asynchronous operation to flush the old memtable to the disk as an SSTable.
  4. When flushing finishes, delete the old memtable and WAL.

What happens if the new memtable gets full before the old one is fully flushed? One strategy is to support having more than one immutable memtables, see [1]. However, it is crucial to flush immutable memtables in-order since memtables are searched before SSTables.

[1] https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

ndragazis commented 1 week ago

File Naming Strategy

Let's assume we allow the system to have multiple immutable memtables and WALs being flushed in parallel. What names should I use for the WAL files?

Requirements:

Same problem applies to SSTables as well.


Cassandra used to solve this using natural numbers, and later they switched into Time UUIDs.

source: https://cassandra.apache.org/_/blog/Apache-Cassandra-4.1-New-SSTable-Identifiers.html


As a first iteration, I will suffix WAL and SSTable filenames with a counter. So, the file structure should look like this:

.tinykv/
├── sstable_1
├── sstable_2
├── wal
└── wal_1
ndragazis commented 6 days ago

Deleting Keys

Deleting keys in LSM-trees is not trivial.

If a key lives only in the mutable memtable, then we can just remove it from the memtable and we are done. However, it could be that it also appears in an immutable memtable or an SSTable.

The solution is to record deleted keys via a deletion marker, also called a "tombstone".