risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7k stars 575 forks source link

Barrier interval tuning for backfilling #14369

Open chenzl25 opened 10 months ago

chenzl25 commented 10 months ago

I conducted a simple experiment and want to answer 2 questions:

  1. What is the gap between Batch query performance and Streaming Mv backfilling performance
  2. What is the best barrier interval for backfilling

The experiment shows that there exists an optimal barrier interval for backfilling which couldn't be too high or too low (assuming one checkpoint per barrier) because a low barrier interval could cause more checkpoints, backfill iterators recreations, and trigger more aggregation emitting. In contrast, a high barrier interval might need to wait for a first barrier but backfill does nothing? Not quite sure. Even the best streaming query performance 25s vs 11s the batch query performance seems to have a gap and could we shorten it?

Experiment:

CREATE TABLE t (a CHARACTER VARYING)

Table t with 12,000,000 rows and table size is about 1.9GiB and partitions (rw_parallelism) is 32.

1 Compute Node: 2c4g

Batch:

SELECT count(*) FROM t;

Time: 11317.226 ms (00:11.317)

Streaming:

CREATE MATERIALIZED VIEW v AS SELECT count(*) FROM t;
barrier_interval_ms Time
200 611134.660 ms (10:11.135)
300 123931.115 ms (02:03.931)
500 51352.134 ms (00:51.352)
1000 34337.459 ms (00:34.337)
1500 30336.477 ms (00:30.336)
2000 28353.186 ms (00:28.353)
2500 25316.983 ms (00:25.317)
2800 28338.540 ms (00:28.339)
3000 30350.556 ms (00:30.351)
4000 32435.967 ms (00:32.436)
5000 30385.330 ms (00:30.385)
8000 32430.144 ms (00:32.430)
10000 40395.891 ms (00:40.396)
BugenZhao commented 10 months ago

because a low barrier interval could cause more .. backfill iterators recreations

IIUC, the backfill loop cycle does not have to be synced with the barrier interval. So what about tuning the "loop frequency" for backfill executor to check how this factor matters? Also, considering that the barrier interval is a global parameter, I guess this can also be eventually be an optimization that is less invasive.

8000 32430.144 ms (00:32.430)
10000 40395.891 ms (00:40.396)

As the backfill only finishes on the boundary barrier, is it possible that most of the work has been done on 24s and 30s (or even shorter) for the barrier interval of 8s and 10s, respectively? If so, there seems no much difference that 2~3s. Perhaps we need a larger amount of data.

chenzl25 commented 10 months ago

IIUC, the backfill loop cycle does not have to be synced with the barrier interval. So what about tuning the "loop frequency" for backfill executor to check how this factor matters? Also, considering that the barrier interval is a global parameter, I guess this can also be eventually be an optimization that is less invasive.

+1, if the backfill loop interval is too small it would affect the backfill throughput, eventually causing a high mv creation time.

github-actions[bot] commented 7 months ago

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.