risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7.03k stars 577 forks source link

Considering auto scaling's behavior when there is many streaming jobs. #17927

Open st1page opened 3 months ago

st1page commented 3 months ago

In a recent real scenario using auto scaling. We found that it takes 10 minutes to move 7000 actors at 8 parallelism to another node. In ten minutes the cluster was unavailable.

BugenZhao commented 3 months ago
  • Since we removed the previous timeout mechanism, in this case if the CN is lost for a short period of time it will cause a cluster-wide rebalancing.

In that PR, the unregistration then auto scaling-in will only be triggered when the shutdown is graceful, typically by applying a new configuration with a smaller replica value to Kubernetes. I'm not quite sure about which scenario you are referring to by "the CN is lost for a short period of time", but I suppose it's not the case.

Regarding the proposed solution, I'm concerned it's not user-friendly. Note that by deprecating low-level scaling interface, we are aiming at simplify the scaling process and trying to make it as seamless as possible.

We found that it takes 10 minutes to move 7000 actors at 8 parallelism

I'm actually more curious about the background of this issue. Have we traced where these 10 minutes are being spent? I suppose that rewriting the metadata should be lightweight. Instead, is it because creating the actors on the compute nodes takes quite a bit of time? If so, does it mean that the bootstrapping procedure from a full restart of the cluster will take a corresponding amount of time, which becomes the root problem to solve?

shanicky commented 3 months ago

With a high degree of horizontal parallelism, the current table fragments mechanism can lead to significant meta store writes when updating graphs. In the future, with a SQL-based backend, the data of actors within fragments will be compressed.

fuyufjh commented 3 months ago

We found that it takes 10 minutes to move 7000 actors at 8 parallelism to another node. In ten minutes the cluster was unavailable.

I'd like to try SQL backend first. As far as we known now, the only possible root cause is the etcd's performance issue as @shanicky mentioned above. In SQL meta backend we moved to a better schema design to solve the "significant meta store writes". If the issue just disappeared in SQL backend, no problem at all.

BugenZhao commented 3 months ago

FYI, the encoded data should be already compressed with etcd backend since https://github.com/risingwavelabs/risingwave/pull/17315.

st1page commented 3 months ago

Thanks for the detailed explaination. So let's wait and see the scaling speed on the SQL meta 🥰