risingwavelabs / risingwave

SQL stream processing, analytics, and management. We decouple storage and compute to offer instant failover, dynamic scaling, speedy bootstrapping, and efficient joins.
https://www.risingwave.com/slack
Apache License 2.0
6.58k stars 536 forks source link

Discussion: ensure scale can complete in time #15490

Open hzxa21 opened 3 months ago

hzxa21 commented 3 months ago

When barrier latency is high due to insufficient parallelism, user may want to scale their streaming jobs accordingly when resources are sufficient to accelerate computation.

The current implementation of scaling has the following properties:

  1. Irrelevant actors won't be dropped and rebuilt.
  2. It relies on barriers (Pause, ConfigChange, Resume) to complete the scaling process.

This results in a dilemma:

This makes me think that it is a flaw in the current scaling mechanism and we should improve it. Some ideas after discussion with @wenym1:

BugenZhao commented 1 week ago
  • Find the first aligned barrier in source and transform it into Pause barrier to trigger scaling immediately.
  • Make scaling to not relying on barrier (@wenym1 can comment more one the details).

I believe https://github.com/risingwavelabs/risingwave/issues/13396 can be addressed by adopting a very similar idea if we find it feasible. BTW, it is possible now to share more on the details?