risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.77k stars 558 forks source link

Discussion(meta): while configuring, only wait for the barriers related to the specific parts of the graph to be collected #17422

Open st1page opened 2 months ago

st1page commented 2 months ago

In RisingWave, configuration changes such as DDL statements or modifications to parallelism are bound to a specific epoch, with the operation information spreading through the graph along with the corresponding barriers.

Currently, the meta node waits until all actors have received the barrier before completing the operation. I am considering whether it might be possible to only wait for the completion of barriers related to specific parts of the graph.

There are two main motivations for this consideration:

  1. We previously discussed decoupling checkpoints, using a log store to decouple upstream and downstream checkpoints. One issue is that the log store does not handle configuration changes well. Even if it can store them, this means the configuration change has to wait a long time before it is actually consumed on the graph. Now, we are preparing to discuss the use of the log store during backfill, and I believe there will be similar issues. This method still does not support changing the parallelism of a materialized view (MV) that is being created, or creating an MV on a MV that is being created, but it solves the problem of changing other streaming jobs.

  2. During on-call support, I often encounter users complaining about the slow completion of DDL operations, and upon investigation, I find that the barrier latency is too high. I intuitively feel that some users are not really concerned about the end-to-end latency but are affected by the fact that we have forcibly bound the control plane's latency to the barrier latency. By only waiting for the barriers of the relevant actors, we can reduce unnecessary waiting for other slow actors.

To be honest, I'm not very familiar with the parts related to the meta node, so I can't assess the engineering difficulty and workload. Furthermore, after writing this issue, I realized that it only alleviates some problems, but many key issues remain unresolved. Therefore, it might just be a feature that would be nice to have, but not critical.

cc @fuyufjh @yezizp2012 @shanicky

BugenZhao commented 2 months ago

Link to #15490 and #13396, sharing quite similar motivations and proposals.