Refactor hummock version representation in meta to enable finer-grain compaction strategy

hzxa21 commented 5 months ago

Recently we have had efficiency issues related to compaction strategy. Here are some examples on top of my head:

16965
Compaction tasks are not parallelized when there are large writes into creating state tables during backfill (tracked in #15628)
Write stall in low write throughput state table when there are high write throughput state tables in the same compaction group LSM (tracked in #15628)
...

There are two mechanisms that can potential resolve the above issue but each of them has its own drawback under the current implementation:

Compaction group split: move a state table from one LSM tree to a separate LSM tree
- Split operation is permanent, meaning that it may create small files and increase IOPS when the write throughput of a split state table is spiky. For example, we don't want to permanently generate small SSTs for a low throughput state table after backfilling is done. One can argue that we can implement group merge but it is complicated and not always feasible. For example, you cannot merge table with table_id = 2 back into a compaction group with table_ids = [1, 3] without trigger a heavy full compaction.
SST cut by table/vnode partition: align compaction output SSTs within one LSM tree with pre-defined boundary
- Compaction strategy is based on compaction group, meaning that LSM configuration, score calculation and picker strategy is per compaction group. This makes it hard to address the limitation like #16965.

In short, 1 is more general but can cause permanent overhead while 2 is more flexible but limited under the scope of one single LSM.

Therefore, currently we are exploring an idea to achieve best of both: If the hummock version representation in meta is more flexible instead of grouped by compaction group, we can adopt finer-grain compaction strategy without permanent overhead. The high-level idea is to generalize mechanism 2 to allow compaction strategy (level selector and picker) to dynamically work on different portion of hummock version within the same compaction group independently. Mechanism 1 will then be treated as the last resort and will be triggered less frequently.

The idea is pre-mature and require further discussion on how it works.

hzxa21 commented 5 months ago

cc @Li0k @Little-Wallace

Little-Wallace commented 5 months ago

I try to separate each state-table in one group as independent LSM trees in this PR: https://github.com/risingwavelabs/risingwave/pull/16919 and make the performance similar to separate groups. But I think I need to refactor hummock version structure to reduce the complex of selector.

risingwavelabs / risingwave

Refactor hummock version representation in meta to enable finer-grain compaction strategy #16971

16965