compaction: try to optimize write-amplification for bottommost-level

Little-Wallace commented 2 years ago

Is your feature request related to a problem? Please describe.

As title.

We can see that most compaction flow is in bottommost-level.

Describe the solution you'd like

delay compaction so that we can compact more files in L5.
set max_bytes_for_level_multiplier dynamically.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

Little-Wallace commented 2 years ago

cc @Li0k

Little-Wallace commented 2 years ago

We can found that the compaction job in level6 cost most throughput. And according the following flamegraph show that LZ4-compression cost most CPU resource. Since we use dynamic-compression, when the number of level is no more than 4, we only compress data in level 6 while level 4 and level 5 do not compress any data.

flamegraph

Little-Wallace commented 2 years ago

For compaction without compression, I found most CPU resource cost on S3-sdk.

DCjanus commented 2 years ago

For compaction without compression, I found most CPU resource cost on S3-sdk.

In this pic, we found that sha256 cost a lot of CPU, ~maybe we should migrate sha256 to cheaper checksum algorithm, like CRC32C.~

I found that cost of sha256 from authorization, ref this doc.

Maybe we should try with UNSIGNED-PAYLOAD, ref this doc

Li0k commented 2 years ago

Some bench result

background

use 3cn-1meta-compactor
use tpch-q12 60w qps scale 12

variable

only change last_level_multiplier
multi level change level_multiplier

Test1

reduce last_level(L6) level_multiplier (base = 10) to 5 and 3

Test2

reduc two_level(L5, L6) level_multiplier (base = 10, 10 ) to (5, 3) and (3, 2)

Simple Conclusion

we reduce the level_multiplier of the lower layer, write bytes of this level is reduced as expected
by observing the compaction_throughput, we found that the data writing in the bench process increased
reducing the level_multiplier can indeed reduce the write amplification of the corresponding layer. unfortunately, the data will accumulate to the upper layer, and the compaction throughput in the bench process will not decrease

Li0k commented 2 years ago

a longer test (40min)

branch (18:30 ~ 19:10)

main (19:40 ~ 20:20)

we can observer that, the data will accumulate to the upper layer（in this test case, L4 -> L3）, and the compaction throughput in the bench process will not decrease
In the test scenario of a single compactor, the read performance will be affected by compaction. Maybe we can consider restricting some behaviors of compact
1. limit the write_amplification of one compact_task
2. limit the pending_task per level

Li0k commented 2 years ago

                    let target_pending_task_count =
                    level_handlers[target_level].pending_tasks_ids().len();

                    if target_pending_task_count >= 4 {
                        tracing::info!(
                            "pick_compaction pending_task deny select_level {} target_level {} target_pending_task_count {}",
                            select_level, target_level, target_pending_task_count
                        );

                        continue;
                    }

                    let select_size: u64 = ret.input_levels[0]
                        .table_infos
                        .iter()
                        .map(|table_info| table_info.get_file_size())
                        .sum();
                    let target_size: u64 = ret.input_levels[1]
                        .table_infos
                        .iter()
                        .map(|table_info| table_info.get_file_size())
                        .sum();

                    let write_amplification = (select_size + target_size) * 100 / select_size;

                    tracing::info!(
                        "pick_compaction select_level {} target_level {} select_size {} target_size {} write_amplification {}",
                        select_level, target_level, select_size, target_size, write_amplification
                    );

                    if write_amplification > 300 {
                        tracing::info!(
                            "pick_compaction write_amplification deny select_level {} target_level {} select_size {} target_size {} write_amplification {}",
                            select_level, target_level, select_size, target_size, write_amplification
                        );
                        continue;
                    }

limit the compact_task count per level
limit write_amplification per compact_task
tpch-Q20 (3cn-1meta-1compactor qps 80w)

bench result

main (18:00 ~ 18:30)
branch (17:25 ~ 17:50)
Compaction all
Compaction bottommost-level write
Compaction write throughput
Node Cpu

Conclusion

after adding the limit for compaction, we have more resources to l0, l0 will not be blocked.
reduce the bottommost-level write
global write throughput not reduced
CPU usage is smoother

risingwavelabs / risingwave