In RisingWave, configuration changes such as DDL statements or modifications to parallelism are bound to a specific epoch, with the operation information spreading through the graph along with the corresponding barriers.
Currently, the meta node waits until all actors have received the barrier before completing the operation. I am considering whether it might be possible to only wait for the completion of barriers related to specific parts of the graph.
There are two main motivations for this consideration:
We previously discussed decoupling checkpoints, using a log store to decouple upstream and downstream checkpoints. One issue is that the log store does not handle configuration changes well. Even if it can store them, this means the configuration change has to wait a long time before it is actually consumed on the graph. Now, we are preparing to discuss the use of the log store during backfill, and I believe there will be similar issues. This method still does not support changing the parallelism of a materialized view (MV) that is being created, or creating an MV on a MV that is being created, but it solves the problem of changing other streaming jobs.
During on-call support, I often encounter users complaining about the slow completion of DDL operations, and upon investigation, I find that the barrier latency is too high. I intuitively feel that some users are not really concerned about the end-to-end latency but are affected by the fact that we have forcibly bound the control plane's latency to the barrier latency. By only waiting for the barriers of the relevant actors, we can reduce unnecessary waiting for other slow actors.
To be honest, I'm not very familiar with the parts related to the meta node, so I can't assess the engineering difficulty and workload. Furthermore, after writing this issue, I realized that it only alleviates some problems, but many key issues remain unresolved. Therefore, it might just be a feature that would be nice to have, but not critical.
In RisingWave, configuration changes such as DDL statements or modifications to parallelism are bound to a specific epoch, with the operation information spreading through the graph along with the corresponding barriers.
Currently, the meta node waits until all actors have received the barrier before completing the operation. I am considering whether it might be possible to only wait for the completion of barriers related to specific parts of the graph.
There are two main motivations for this consideration:
We previously discussed decoupling checkpoints, using a log store to decouple upstream and downstream checkpoints. One issue is that the log store does not handle configuration changes well. Even if it can store them, this means the configuration change has to wait a long time before it is actually consumed on the graph. Now, we are preparing to discuss the use of the log store during backfill, and I believe there will be similar issues. This method still does not support changing the parallelism of a materialized view (MV) that is being created, or creating an MV on a MV that is being created, but it solves the problem of changing other streaming jobs.
During on-call support, I often encounter users complaining about the slow completion of DDL operations, and upon investigation, I find that the barrier latency is too high. I intuitively feel that some users are not really concerned about the end-to-end latency but are affected by the fact that we have forcibly bound the control plane's latency to the barrier latency. By only waiting for the barriers of the relevant actors, we can reduce unnecessary waiting for other slow actors.
To be honest, I'm not very familiar with the parts related to the meta node, so I can't assess the engineering difficulty and workload. Furthermore, after writing this issue, I realized that it only alleviates some problems, but many key issues remain unresolved. Therefore, it might just be a feature that would be nice to have, but not critical.
cc @fuyufjh @yezizp2012 @shanicky