Open ndragazis opened 1 week ago
Reading from Martin Kleppmann's book "Designing Data Intensive Applications" (Chapter 3), there are two categories for storage engines:
The log-structured storage engines store the data in log-structured files on disk and construct in-memory indexes to accelerate lookups.
The page-oriented storage engines store data in sorted order in fixed-size pages on disk and maintain a persistent B-tree for fast lookups.
The main difference is that page-structured storage engines perform updates in-place, whereas log-structured storage engines perform append-only operations and remove stale values in later stages (log merging, log compaction).
I will follow the LSM-tree approach, which is closer to Scylla.
High-level Plan:
When the memtable reaches some size, the storage engine will be doing the following:
What happens if the new memtable gets full before the old one is fully flushed? One strategy is to support having more than one immutable memtables, see [1]. However, it is crucial to flush immutable memtables in-order since memtables are searched before SSTables.
[1] https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
Let's assume we allow the system to have multiple immutable memtables and WALs being flushed in parallel. What names should I use for the WAL files?
Requirements:
Same problem applies to SSTables as well.
Cassandra used to solve this using natural numbers, and later they switched into Time UUIDs.
source: https://cassandra.apache.org/_/blog/Apache-Cassandra-4.1-New-SSTable-Identifiers.html
As a first iteration, I will suffix WAL and SSTable filenames with a counter. So, the file structure should look like this:
.tinykv/
├── sstable_1
├── sstable_2
├── wal
└── wal_1
Deleting keys in LSM-trees is not trivial.
If a key lives only in the mutable memtable, then we can just remove it from the memtable and we are done. However, it could be that it also appears in an immutable memtable or an SSTable.
The solution is to record deleted keys via a deletion marker, also called a "tombstone".
I need to define how the key-value store shall persist data on disk. I also need to explore the API that seastar provides for disk IO operations.
Things to consider: