Considering auto scaling's behavior when there is many streaming jobs.

st1page commented 3 months ago

In a recent real scenario using auto scaling. We found that it takes 10 minutes to move 7000 actors at 8 parallelism to another node. In ten minutes the cluster was unavailable.

We might need a way for meta to evaluate the cost of initiating auto scaling if it would pause the cluster for too long, maybe
- not scale and record a warning log
- scale incrementally rather than together, and don't let the cluster PAUSE for too long
We might discuss https://github.com/risingwavelabs/risingwave/pull/17662#pullrequestreview-2198466314 again. Since we removed the previous timeout mechanism, in this case if the CN is lost for a short period of time it will cause a cluster-wide rebalancing. I can come up with these methods
1. Add a system parameter min_cores_to_start. The cluster will only start the streaming and try reblance when the current CPU cores is larger than min_cores_to_start
2. Add a system parameter compute_nodes_list which is a list contains the compute node's ip. The cluster will only start the streaming and try reblance when all compute nodes has been registed in the meta

BugenZhao commented 3 months ago

Since we removed the previous timeout mechanism, in this case if the CN is lost for a short period of time it will cause a cluster-wide rebalancing.

In that PR, the unregistration then auto scaling-in will only be triggered when the shutdown is graceful, typically by applying a new configuration with a smaller replica value to Kubernetes. I'm not quite sure about which scenario you are referring to by "the CN is lost for a short period of time", but I suppose it's not the case.

Regarding the proposed solution, I'm concerned it's not user-friendly. Note that by deprecating low-level scaling interface, we are aiming at simplify the scaling process and trying to make it as seamless as possible.

We found that it takes 10 minutes to move 7000 actors at 8 parallelism

I'm actually more curious about the background of this issue. Have we traced where these 10 minutes are being spent? I suppose that rewriting the metadata should be lightweight. Instead, is it because creating the actors on the compute nodes takes quite a bit of time? If so, does it mean that the bootstrapping procedure from a full restart of the cluster will take a corresponding amount of time, which becomes the root problem to solve?

shanicky commented 3 months ago

With a high degree of horizontal parallelism, the current table fragments mechanism can lead to significant meta store writes when updating graphs. In the future, with a SQL-based backend, the data of actors within fragments will be compressed.

fuyufjh commented 3 months ago

We found that it takes 10 minutes to move 7000 actors at 8 parallelism to another node. In ten minutes the cluster was unavailable.

I'd like to try SQL backend first. As far as we known now, the only possible root cause is the etcd's performance issue as @shanicky mentioned above. In SQL meta backend we moved to a better schema design to solve the "significant meta store writes". If the issue just disappeared in SQL backend, no problem at all.

BugenZhao commented 3 months ago

FYI, the encoded data should be already compressed with etcd backend since https://github.com/risingwavelabs/risingwave/pull/17315.

st1page commented 3 months ago

Thanks for the detailed explaination. So let's wait and see the scaling speed on the SQL meta 🥰

risingwavelabs / risingwave

Considering auto scaling's behavior when there is many streaming jobs. #17927