Closed justinethier closed 2 years ago
From: https://dev.to/creativcoder/what-is-a-lsm-tree-3d75
Over time, the number of SSTable files on disk will ultimately grow in number as flushes keep happening due to the Memtable getting full. In this case, when reading a given key, the LSM Tree will have to look into multiple SSTable files in the worst case which will result in a lot of random I/O everytime a new file is accessed. For example, in Cassandra, the values stored have many columns in them. An update to key k at time t1, might add 4 column values (k -> (v1, v2, v3, v4)). At a later point in time t2, the same key k might update the 4th column value to some other value and might delete the 2nd column value (k -> (v1, v3, v4_1)). When the same key k is read at a later time t3, cassandra will need to aggregate updates from both of the changes above to build an updated value for k. Now if these changes for the given key k is spread across multiple SSTable files, reads will need to perform a lot of random I/O to retrieve the updated value. This deteriorates the read performance and so there needs to be a way to combine these updates in a single file if possible. This is done by a phase called compaction.
Compaction is a critical phase of LSM Tree data management lifecycle allowing them to be used for extended periods of time in practical systems. Think of it like a garbage collector for the LSM Tree. It keeps the disk usage and read/write performance of the LSM Tree in check. The core algorithm utilized by compaction is the k-way merge sort algorithm adapted to SSTables. I wrote a post on this if you are unfamiliar with the algorithm. In summary, we keep pulling the next key value pair from given k lists of entries (based on whatever the sort order is) and put them one by one in a new SSTable file. During the process, we remove entries which are older or merge them with newer entries and delete older entries which have a tombstone value in the recent entry.
Apart from reducing multiple I/Os, another problem that compaction helps solve is that, there might be multiple old entries for a given key or deleted tombstone entries which ends up taking space on disk. This happens when a delete or an update is performed for an existing key at a later time. It then becomes essential to reclaim disk space for outdated and deleted keys and compaction does this by discarding older or deleted entries when creating a new SSTable file.
In terms of implementation, a dedicated background thread is used to perform compaction. Compaction can either be done manually by calling an API or automatically as a background task when certains conditions are met. The compaction process can be synchronous or asynchronous depending on what consistancy guarantees are configured for the LSM Tree store. The read write performance of a LSM Tree is greatly impacted by how compaction happens. As an optimization, there might be multiple dedicated compactor threads running at a given point in time (as in RocksDB) to compact SSTables files. The compaction thread is also responsible for updating the Sparse Index post any merge, to point to the new offsets of keys in the new SSTable file. Everytime the size of files in level 0 reaches the configured threshold, a file or multiple of them is taken and merged with all overlapping keys of the next level (level 1). The same happens with level 1 to level 2 and likewise when those levels reach their capacity. Now, the metric to determine when levels become full can be different based the compaction strategy. Different compaction strategies are utilized based on the requirements and they determine the shape of the SSTables at all the levels.
This will also involve performing a k-way merge of sorted lists (one per SST file).
Suggest setting up tests specifically for this merge and the compaction algorithm, to start.
The compactor is in place now and there is a separate ticket to manage compaction strategies.
Need to develop a compaction algorithm and then have a background thread periodically compact the SST files.