Investigate and avoid small SSTs in bottom level

hzxa21 commented 12 months ago

Under a certain kind of workload, even after compaction, there can be many small SSTs written to the bottom level, each of which only contains one vnode. This can potentially cause several issues:

It can increase the time spent in compaction picker and cause compaction to be slow or stuck. (#10109)
It can cause excessive requests generated for batch and streaming query
It can generate more metadata than expected in meta store

In practice we have seen more than 10000 small SSTs in level 5 and here are the sst stats: l5..sst.stat.txt l4.sst.stat.txt

Some ideas:

Pick a not so small compaction task in best effort for non-L0 compaction. If all tasks are small, proceed with the largest one. This may also prevent smalll SST trivial move. Note that this may break correctness if we still lock pending file instead of pending key range for compaction (See #11653)
Add a separate picker to merge small files in bottom levels.

More ideas are welcomed.

Note that we should also reduce the time complexity of the picker algorithm (#10109) but it is independent of this issue.

hzxa21 commented 12 months ago

cc @Li0k @Little-Wallace

Li0k commented 11 months ago

We suspect that this problem comes from trivial-move. When the data has been trivial-move from l0 to base level without going through compact, it may cause a large number of small files to be located at the high level (size of level is no enough to trigger the level compaction)

Hence, two pr had been proposed

avoid small file generated by vnode partition
avoid small file trivial_move when using TrivialMovePicker

Little-Wallace commented 10 months ago

I think this problem has been partial solved by https://github.com/risingwavelabs/risingwave/pull/12534 because we won't move small sst files to base level in split group.

Li0k commented 6 months ago

I have been concerned about the problem of small files in sub level, and small files usually come from two issues:

trivial-move
split sst by some rule.

For issue 2，we are exploring two options to solve it

https://github.com/risingwavelabs/risingwave/pull/16585 (Adjusting the target_file_size)
https://github.com/risingwavelabs/risingwave/pull/15547 (More reliable alignment algorithm)

However, trivial-move sst is a more difficult problem in cg2/cg3. The algorithm of pick_sub_level will have more algorithmic complexity as sst stacks up at the last sub-level. Therefore, I suggest an improved strategy, when there are too many sub-level partitions in the last sub-level (a large number of SSTs have been accumulated), we should allow multiple SSTs to be selected as the base range in each loop. In Sort: when there are too many ssts in last sub-level, no longer use single sst as the base range, but should consider multiple consecutive ssts and use size to limit it.

@hzxa21 @Little-Wallace @zwang28

risingwavelabs / risingwave

Investigate and avoid small SSTs in bottom level #13876