Tracking: Automatically set parallelism for streaming jobs

shanicky commented 8 months ago

For the current implementation, users have to manually execute risectl to scale up or down the parallel unit after bringing a new CN online or before taking a CN offline. This can be quite cumbersome in some scenarios, so we need an automated scaling strategy to handle backend expansion and contraction.

At present, our designed solution is to automatically control the parallelism of a group of streaming jobs bound by NOSHUFFLE (or future Ensemble). This can be cascadedly modified through the user interface for any one of the streaming jobs (because the parallelism is bound).

Currently there are three strategies Adaptive, Fixed and Custom.

Adaptive will automatically scale up and down to the current parallelism limit of the cluster.

Fixed will keep the parallelism fixed, it will always remain during the online and offline process of the cluster nodes, but it will generate migration to balance traffic. Before the Ensemble feature goes live, we need a compatibility mode to ensure the behavior when the available parallel units of the cluster is less than the fixed number.

Custom is a low-level mode prepared for the cloud team, intended for potential refined traffic control in the future. It’s not something we’re considering at the moment. Any fragments marked as ‘custom’ (if any exist) will not perform any actions.

[ ] Implement Ensemble based streaming job scheduler.
[x] #13302
[x] #13266
[x] Automated scaling strategy (also known as scaling controller).
- [x] Migration strategy for streaming jobs with fixed parallelsim.
- [x] Dependency handling for streaming jobs with NOSHUFFLE.
- [ ] Traffic control.
[x] Interfaces (SQL or risectl) used to set the target degree of parallelism.
[x] Compatible with custom (i.e., low level) mode from cloud?

Question:

Regarding the parallelism transmission of NoShuffle, should modifications to the downstream tables of NoShuffle inversely affect its upstream, or should the transmission of NoShuffle always be unidirectional from top to bottom?
- Firstly, a top-down transmission is reasonable and meaningful. Based on this, if we support bottom-up transmission, it would result in an unstable situation where an intermediate node in a complex tree structure is transmitted to its parent and sibling nodes, which could potentially affect the parallelism of the root mv in multi-layered structures. This is risky.
Due to the associative nature of the NOSHUFFLE Relation (also known as ensemble), there may exist a dependency in which one fragment is set to fixed and another fragment is set to adaptive. In such case, is there a need for an overriding priority level?
If a complex MV (with multiple levels of fragments) is dependent on another MV and immediately enters a hash exchange after the chain, then the parallelism of the subsequent multi-level fragments can actually be set independently. That is, this MV is not restricted by the NOSHUFFLE relation. So, is it necessary for the modification of the parallelism of the upstream MV to be propagated to the entire downstream (or is it only necessary to be propagated into the fragment that depends on noshuffle)?
- According to the current design, it’s reasonable to only propagate to the chain.

fuyufjh commented 7 months ago

Currently there are three strategies Adaptive, Fixed and Custom.

This design totally makes sense to me, but I am a bit afraid that it may be too obscure for our users.

If we can agree on that Custom is not that necessary, I would recommend to follow #12058, where I and @neverchanje proposed a syntax:

For MViews with streaming_parallelism = AUTO: Use Adaptive policy here
For MViews with streaming_parallelism = <number>: Use Fixed policy here (also corresponds to the status quo)

What do you think?

BTW, #13270 used a global parameter opts.enable_scale_in_when_recovery which will break the design, we need to make the design before release the enable_scale_in_when_recovery.

lmatz commented 7 months ago

link https://github.com/risingwavelabs/risingwave/issues/12741 as I am not sure if the fact that we may want a table's partition number (equal to the streaming parallelism right now) may have some minor impact on this design

risingwavelabs / risingwave

Tracking: Automatically set parallelism for streaming jobs #13140