risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7.1k stars 586 forks source link

Investigate and avoid small SSTs in bottom level #13876

Open hzxa21 opened 12 months ago

hzxa21 commented 12 months ago

Under a certain kind of workload, even after compaction, there can be many small SSTs written to the bottom level, each of which only contains one vnode. This can potentially cause several issues:

In practice we have seen more than 10000 small SSTs in level 5 and here are the sst stats: l5..sst.stat.txt l4.sst.stat.txt

Some ideas:

  1. Pick a not so small compaction task in best effort for non-L0 compaction. If all tasks are small, proceed with the largest one. This may also prevent smalll SST trivial move. Note that this may break correctness if we still lock pending file instead of pending key range for compaction (See #11653)
  2. Add a separate picker to merge small files in bottom levels.

More ideas are welcomed.

Note that we should also reduce the time complexity of the picker algorithm (#10109) but it is independent of this issue.

hzxa21 commented 12 months ago

cc @Li0k @Little-Wallace

Li0k commented 11 months ago

We suspect that this problem comes from trivial-move. When the data has been trivial-move from l0 to base level without going through compact, it may cause a large number of small files to be located at the high level (size of level is no enough to trigger the level compaction)

Hence, two pr had been proposed

  1. avoid small file generated by vnode partition
  2. avoid small file trivial_move when using TrivialMovePicker
Little-Wallace commented 10 months ago

I think this problem has been partial solved by https://github.com/risingwavelabs/risingwave/pull/12534 because we won't move small sst files to base level in split group.

Li0k commented 6 months ago

I have been concerned about the problem of small files in sub level, and small files usually come from two issues:

  1. trivial-move
  2. split sst by some rule.

For issue 2,we are exploring two options to solve it

  1. https://github.com/risingwavelabs/risingwave/pull/16585 (Adjusting the target_file_size)
  2. https://github.com/risingwavelabs/risingwave/pull/15547 (More reliable alignment algorithm)

However, trivial-move sst is a more difficult problem in cg2/cg3. The algorithm of pick_sub_level will have more algorithmic complexity as sst stacks up at the last sub-level. Therefore, I suggest an improved strategy, when there are too many sub-level partitions in the last sub-level (a large number of SSTs have been accumulated), we should allow multiple SSTs to be selected as the base range in each loop. In Sort: when there are too many ssts in last sub-level, no longer use single sst as the base range, but should consider multiple consecutive ssts and use size to limit it.

@hzxa21 @Little-Wallace @zwang28