Open st1page opened 3 months ago
- Since we removed the previous timeout mechanism, in this case if the CN is lost for a short period of time it will cause a cluster-wide rebalancing.
In that PR, the unregistration then auto scaling-in will only be triggered when the shutdown is graceful, typically by applying a new configuration with a smaller replica
value to Kubernetes. I'm not quite sure about which scenario you are referring to by "the CN is lost for a short period of time", but I suppose it's not the case.
Regarding the proposed solution, I'm concerned it's not user-friendly. Note that by deprecating low-level scaling interface, we are aiming at simplify the scaling process and trying to make it as seamless as possible.
We found that it takes 10 minutes to move 7000 actors at 8 parallelism
I'm actually more curious about the background of this issue. Have we traced where these 10 minutes are being spent? I suppose that rewriting the metadata should be lightweight. Instead, is it because creating the actors on the compute nodes takes quite a bit of time? If so, does it mean that the bootstrapping procedure from a full restart of the cluster will take a corresponding amount of time, which becomes the root problem to solve?
With a high degree of horizontal parallelism, the current table fragments mechanism can lead to significant meta store writes when updating graphs. In the future, with a SQL-based backend, the data of actors within fragments will be compressed.
We found that it takes 10 minutes to move 7000 actors at 8 parallelism to another node. In ten minutes the cluster was unavailable.
I'd like to try SQL backend first. As far as we known now, the only possible root cause is the etcd's performance issue as @shanicky mentioned above. In SQL meta backend we moved to a better schema design to solve the "significant meta store writes". If the issue just disappeared in SQL backend, no problem at all.
FYI, the encoded data should be already compressed with etcd backend since https://github.com/risingwavelabs/risingwave/pull/17315.
Thanks for the detailed explaination. So let's wait and see the scaling speed on the SQL meta 🥰
In a recent real scenario using auto scaling. We found that it takes 10 minutes to move 7000 actors at 8 parallelism to another node. In ten minutes the cluster was unavailable.
We might need a way for meta to evaluate the cost of initiating auto scaling if it would pause the cluster for too long, maybe
We might discuss https://github.com/risingwavelabs/risingwave/pull/17662#pullrequestreview-2198466314 again. Since we removed the previous timeout mechanism, in this case if the CN is lost for a short period of time it will cause a cluster-wide rebalancing. I can come up with these methods
min_cores_to_start
. The cluster will only start the streaming and try reblance when the current CPU cores is larger thanmin_cores_to_start
compute_nodes_list
which is a list contains the compute node's ip. The cluster will only start the streaming and try reblance when all compute nodes has been registed in the meta