risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7.03k stars 578 forks source link

compaction: try to optimize write-amplification for bottommost-level #4639

Open Little-Wallace opened 2 years ago

Little-Wallace commented 2 years ago

Is your feature request related to a problem? Please describe.

As title.

We can see that most compaction flow is in bottommost-level.

Describe the solution you'd like

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

Little-Wallace commented 2 years ago

cc @Li0k

Little-Wallace commented 2 years ago
image

We can found that the compaction job in level6 cost most throughput. And according the following flamegraph show that LZ4-compression cost most CPU resource. Since we use dynamic-compression, when the number of level is no more than 4, we only compress data in level 6 while level 4 and level 5 do not compress any data.

flamegraph

Little-Wallace commented 2 years ago

For compaction without compression, I found most CPU resource cost on S3-sdk.

image
DCjanus commented 2 years ago

For compaction without compression, I found most CPU resource cost on S3-sdk. image

In this pic, we found that sha256 cost a lot of CPU, ~maybe we should migrate sha256 to cheaper checksum algorithm, like CRC32C.~

I found that cost of sha256 from authorization, ref this doc.

Maybe we should try with UNSIGNED-PAYLOAD, ref this doc

Li0k commented 2 years ago

Some bench result

background

variable

Test1

image

image

Test2

image image

Simple Conclusion

Li0k commented 2 years ago

a longer test (40min)

branch (18:30 ~ 19:10)

main (19:40 ~ 20:20)

image image (1)
Li0k commented 2 years ago
                    let target_pending_task_count =
                    level_handlers[target_level].pending_tasks_ids().len();

                    if target_pending_task_count >= 4 {
                        tracing::info!(
                            "pick_compaction pending_task deny select_level {} target_level {} target_pending_task_count {}",
                            select_level, target_level, target_pending_task_count
                        );

                        continue;
                    }

                    let select_size: u64 = ret.input_levels[0]
                        .table_infos
                        .iter()
                        .map(|table_info| table_info.get_file_size())
                        .sum();
                    let target_size: u64 = ret.input_levels[1]
                        .table_infos
                        .iter()
                        .map(|table_info| table_info.get_file_size())
                        .sum();

                    let write_amplification = (select_size + target_size) * 100 / select_size;

                    tracing::info!(
                        "pick_compaction select_level {} target_level {} select_size {} target_size {} write_amplification {}",
                        select_level, target_level, select_size, target_size, write_amplification
                    );

                    if write_amplification > 300 {
                        tracing::info!(
                            "pick_compaction write_amplification deny select_level {} target_level {} select_size {} target_size {} write_amplification {}",
                            select_level, target_level, select_size, target_size, write_amplification
                        );
                        continue;
                    }
  1. limit the compact_task count per level
  2. limit write_amplification per compact_task
  3. tpch-Q20 (3cn-1meta-1compactor qps 80w)

bench result

Conclusion

  1. after adding the limit for compaction, we have more resources to l0, l0 will not be blocked.
  2. reduce the bottommost-level write
  3. global write throughput not reduced
  4. CPU usage is smoother