Open weidalin opened 1 month ago
Hi @weidalin, thank you for your interest in this feature.
Actually, this feature was originally designed to support batch selecting of resources through resourceSelector
.
However, during the later discussion, we were worried that batch rescheduling was a dangerous operation that users maybe not dare to use or could easily misuse, so we deliberately cancelled this entrance.
Simply put, there was a lack of user's support at that time. If there are more real users calling for resourceSelector
, we are very happy to consider supporting this ability.
I'm thinking of moving forward with the plan you mentioned, could you provide me with some information.
resourceSelector
, is it just a bit more troublesome to write the CR of WorkloadRebalancer using array-based
way, or as your cluster adminstrator role, you actually not perceive the user's resources?I'm thinking of moving forward with the plan you mentioned, could you provide me with some information.
Could you briefly describe the scenarios in which you are using Karmada? And what scenarios do users need to use the rescheduling by WorkloadRebalancer ? We are an AI training and inference platform. During the daytime, Cluster A handles both training and inference tasks, while Cluster B handles only inference tasks. During the nighttime, when inference traffic is low, we want to free up resources in Cluster A for training only, and consolidate inference tasks to Cluster B. Therefore, at 11:00 PM and 7:00 AM, we need to perform a rescheduling to move the inference workloads.
In your above "allow me to reschedule resources for the same team", what's the relationship between "me" and "team"? Is that mean you are the role like cluster adminstrator of a platform, and many team deploy apps on your platform, in some case you need to do the rescheduling for one whole team? If so, do you worry about the impact of batch rescheduling on users? Yes, we are an AI training and inference platform where many teams deploy their applications. We would like to reschedule applications based on team labels. We try to minimize the impact of rescheduling on users, as we ensure that the replica count in Cluster B does not scale to zero. We carefully manage the transition so that there are always replicas available in Cluster B before and after the rescheduling.
What are the current pain points for rebalancers? Compared with using
resourceSelector
, is it just a bit more troublesome to write the CR of WorkloadRebalancer usingarray-based
way, or as your cluster adminstrator role, you actually not perceive the user's resources? The pain point is that there are too many applications within a team. If we specify the resources to be rescheduled using the current array-based approach in WorkloadRebalancer, the .spec.workloads list will be very long. Moreover, applications in a team may be added or removed. If we can use labels to select resources, it would be much more convenient.By the way, could you please let us know which company you are from (we'd like to confirm if it's already listed among our adopters) Thank you very much for your support. We are Zhuhai Kingsoft Office Software Co., Ltd.
Hi @weidalin, thank you for your interest in this feature.
Actually, this feature was originally designed to support batch selecting of resources through
resourceSelector
. However, during the later discussion, we were worried that batch rescheduling was a dangerous operation that users maybe not dare to use or could easily misuse, so we deliberately cancelled this entrance.Simply put, there was a lack of user's support at that time. If there are more real users calling for
resourceSelector
, we are very happy to consider supporting this ability.
Hi @chaosi-zju thank you very much for your reply. Could you please let us know if there is a branch of the WorkloadRebalancer that already supports ResourceSelectors? If so, could you share it with us? If not, what are your future development plans regarding this feature?
Additionally, when you mentioned "we were worried that batch rescheduling was a dangerous operation," could you clarify what kind of scenarios you were referring to?
Could you please let us know if there is a branch of the WorkloadRebalancer that already supports ResourceSelectors?
Sorry, not yet.
If not, what are your future development plans regarding this feature?
The scene you provided is very interesting. I think we will decide as soon as possible whether to support this ability or not.
could you clarify what kind of scenarios you were referring to?
I mean replicas distribution changes dramatically.
We previously mainly supported deployment, which is a bit different from inference training tasks. In that case, the user does not want the pods which are running fine to undergo major changes, however, labelselector
way can involve in many pods, which brings risk of great changes to the overall replicas distribution, thereby it will shock the system to some extent.
Hi @weidalin @so2bin, thanks for the feedback and input. I think we can iterate the WorkloadRebalancer based on your scenario.
The WorkloadRebalancer
just takes responsibility for triggering the re-schedule, given the impact of re-scheduling might be significant, the replica distribution might vary greatly, for example, all replicas could be migrated from the origin cluster, the service quality might be a challenge in that case, you need to make sure the load-balancer across clusters well-configured, so it needs to be used very carefully. In addition, we don't have such a use case that expects batch re-schedule before, that's the reason why we don't support that.
We might need to ask a few more questions to better understand your use case.
As @so2bin mentioned above, you decided to use CPP with DynamicWeight strategy, can you share a copy of CPP you are using?
Is there a load balancer across clusters in your case? Since you are re-scheduling inference workloads, how do you handle the traffic upon replicas in two clusters?
@chaosi-zju I hope the following words may help you to understand our scenario, thanks.
StaticWeight
strategy, and the next version design, which will utilize WorkloadRebalancer
and DynamicWeight
strategy.About the traffic load-balancer: we have not use the MultiClusterService
CRD of Karmada, but used the istio's Multi-Primary Multi-Networks mesh network to implement the traffic load-balance between the clusters. So each deployment(AI inference service) will has a istio DestinationRule
for load balance, like this:
There is a custom-controller to synchronize the clusters' traficc percent from the
Endpoints
in worker clusters.apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: starcode-triton-server-v3 namespace: ai-app spec: host: starcode-triton-server-v3.ai-app.svc.cluster.local trafficPolicy: connectionPool: ... loadBalancer: localityLbSetting: distribute: - from: <Cluster-Name-A>/* to: <Cluster-Name-A>/*: 80 <Cluster-Name-E>/*: 20 ... - from: <Cluster-Name-E>/* to: <Cluster-Name-A>/*: 80 <Cluster-Name-E>/*: 20 ... - ... # there can extend more clusters here outlierDetection: ...
StaticWeight
CPP: each team use a CPP to distribute the pods to worker clusters, the CPP demo like this:
apiVersion: policy.karmada.io/v1alpha1
kind: ClusterPropagationPolicy
metadata:
labels:
...
name: team-1005-app-0-cpp-default-day-time
spec:
conflictResolution: Abort
placement:
#...
replicaScheduling:
replicaDivisionPreference: Weighted
replicaSchedulingType: Divided
weightPreference:
staticWeightList:
- targetCluster:
clusterNames:
- <Custer-Name-A>
weight: 80
- targetCluster:
clusterNames:
- <Custer-Name-E>
weight: 20
- ... # more clusters
preemption: Never
priority: 30
resourceSelectors:
- apiVersion: v1
kind: ConfigMap # we use cm but CRD
labelSelector:
matchLabels:
atms-app/appid: "10471"
atms-app/teamid: "1005"
...
How to do computational bursts: we have two timer which cronly triggerd at 07:00, 23:00 every day, it will switch between the day-time CPP and the nigh-time CPP with delete and create CPP (here we donot use CPP preemption).
Yes, as you mentioned, there is a traffic risk if all replicas are migrated from one cluster to another cluster. However, in our scenario, the CPPs are managed by the platform manager, so we will take care of it and avoid this problem.
DynamicWeight
strategy for this purpose. To achieve this, we need to make two extensions:
DynamicWeight
CPP for the daily timer. Fortunately, the Karmada v1.10 release includes a WorkloadRebalancer
CRD, which allows us to get it if WorkloadRebalancer
supports batch-rebalance feature.apiVersion: policy.karmada.io/v1alpha1
kind: ClusterPropagationPolicy
metadata:
labels:
...
name: team-1005-app-0-cpp-default-day-time
spec:
conflictResolution: Abort
placement:
#...
replicaScheduling:
replicaDivisionPreference: Weighted
replicaSchedulingType: Divided
weightPreference:
# Here we do not use `AvailableReplicas` but extend a new DynamicWeightFactor for the custom-estimator use case.
# In the future, we may push a PR to explain it.
dynamicWeight: EstimatorCalculatedReplicas
preemption: Never
priority: 30
resourceSelectors:
- apiVersion: v1
kind: ConfigMap # we use cm but CRD
labelSelector:
matchLabels:
atms-app/appid: "10471"
atms-app/teamid: "1005"
...
The previous design has some drawbacks. Each deployment's propagation always uses the specified team CPP with static weight distribution. This can lead to a pending problem where a pod is pending in Cluster-A, but Cluster-B has enough available GPUs.
That is due to the StaticWeight
mode doesn't consider available resources during the replica assignment.
Coincidentally, we are discussing if we need to enhance the StaticWeight
to let it take available resources at https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217. May I ask what's your opinion on this?
How to do computational bursts: we have two timer which cronly triggerd at 07:00, 23:00 every day, it will switch between the day-time CPP and the nigh-time CPP with delete and create CPP (here we donot use CPP preemption).
Change the CPP also involves re-schedule, so you don't need WorkloadRebalancer
. In your next plan, you are going to use DynamicWeight
, changing the CPP won't get re-scheduled, so you need a way to force reschedule, am I right?
Hi @so2bin @weidalin , thank you very much for the above valuable practice you provided. I am amazed at the depth of your exploration and the clear and detailed reply 👍 .
I think I understand why you have this appeal. But I'm interested in a question: as you said above, one sentence is "each team use a CPP to distribute the pods to worker clusters", another is "the CPPs are managed by the platform manager", I wonder who is exactly in charge of this cpp. Do you mean the CPP is declared by user, but the operation of apply/delete
is in charged by platform manager?
enhance the StaticWeight to let it take available resources
it seems they not only wants to take available resources account, but also use the custom estimator to get more accurate results for their scenarios.
Change the CPP also involves re-schedule, so you don't need
WorkloadRebalancer
. In your next plan, you are going to useDynamicWeight
, changing the CPP won't get re-scheduled, so you need a way to force reschedule, am I right?
@RainbowMango
Apologies for the delayed response. We have conducted further verification by performing a delete and create operation on the CPP with DynamicWeight
type. The result shows that this operation does not trigger resource re-scheduling, as the placement
remains unchanged:
https://github.com/karmada-io/karmada/blob/ba360c9aa7e24ffb1fffdcb14dd2b828a74fbafa/pkg/scheduler/helper.go#L51
Hi @so2bin @weidalin , thank you very much for the above valuable practice you provided. I am amazed at the depth of your exploration and the clear and detailed reply 👍 .
I think I understand why you have this appeal. But I'm interested in a question: as you said above, one sentence is "each team use a CPP to distribute the pods to worker clusters", another is "the CPPs are managed by the platform manager", I wonder who is exactly in charge of this cpp. Do you mean the CPP is declared by user, but the operation of
apply/delete
is in charged by platform manager?
@chaosi-zju We are platform maintainers and administrators, and we have full operational permission over team CPPs. These CPPs are created and managed by us. For platform users, the creation and changes of CPPs are seamless and transparent. Therefore, we also hope that the process of cluster-level computational burst scheduling is seamless and transparent to platform users.
enhance the StaticWeight to let it take available resources
it seems they not only wants to take available resources account, but also use the custom estimator to get more accurate results for their scenarios.
Yes, we need to take into account a fine team-level resource distribution from custom-custimator server.
We are platform maintainers and administrators, and we have full operational permission over team CPPs. These CPPs are created and managed by us. For platform users, the creation and changes of CPPs are seamless and transparent.
I guess each team probably has multiple applications on your platform, just out of curiosity, do you manage one CPP per team or one CPP per application?
We are platform maintainers and administrators, and we have full operational permission over team CPPs. These CPPs are created and managed by us. For platform users, the creation and changes of CPPs are seamless and transparent.
I guess each team probably has multiple applications on your platform, just out of curiosity, do you manage one CPP per team or one CPP per application?
@RainbowMango By default, each team has one team-level CPP, and a few exceptional app use app-level CPPs, which have higher priority.
Hi @so2bin, thank you for your above explanation~
there is a traffic risk if all replicas are migrated from one cluster to another cluster. However, in our scenario, the CPPs are managed by the platform manager, so we will take care of it and avoid this problem.
A new question comes, as the role of platform manager, is there any detail way to "take care of it and avoid this problem"? And, if you don't know much about the specific team apps, how do you judge that the team's apps are not affected?
If your method is common or inspiring for most other people, I think we don't need to concern so much and just start to push forward the feature you mentioned.
That is due to the
StaticWeight
mode doesn't consider available resources during the replica assignment. Coincidentally, we are discussing if we need to enhance theStaticWeight
to let it take available resources at [[#4805 (comment)](https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217)](https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217). May I ask what's your opinion on this?
After carefully reviewing [[#4805 (comment)](https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217)](https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217), I believe the two main issues and optimizations regarding static weight are as described below. Is my understanding correct?
@whitewindmills Whether it's the static weight strategy or this AverageReplicas strategy, they are both ways of assigning replicas. Currently, the static weights strategy has two main "disadvantages":
1 It doesn't comply with the spread constraints. 2 It doesn't take the available replicas in the cluster into account.
@RainbowMango's view on this feature is that we could try to enhance the existing staticWeight. The possible improvements could be:
1 Make static weight consider spread constraints when selecting target clusters. 2 Make static weight take available resources into account (i.e., if any cluster lacks sufficient resources, the scheduling fails).
We also encounter these problems when using a static weighting strategy. I agree that making static weight take available resources into account (if any cluster lacks sufficient resources, the scheduling fails) is a good approach to avoid Pods Pending, but simply letting scheduling fail may not be a good approach.
I would like to explain the impact of this change based on the actual usage scenarios of our AI applications. I hope it can help you evaluate this static weight change.
Tidal Cluster Weight Switching Scenario Insufficient resources in any cluster will cause this scheduling to fail.
Before the static weight optimization : Some Pods in clusters with insufficient resources will enter Pending, while clusters with sufficient resources will successfully complete the Tidal cluster weight switch.
After the static weight optimization : Because any cluster has insufficient resources, the tidal cluster weight switch has not been completed.
In this scenario, static weight optimization can help us avoid entering the Pending state, which is indeed an optimization method. But it cannot satisfy our need to switch the weight of the tidal cluster.
In the tidal cluster weight switching scenario, it is more suitable for us to use dynamic weight and WorkloadRebalancer in combination. So we hope that WorkloadRebalancer can support LabelSelector.
Scaling Application Replicas Scenario If cluster A has insufficient resources while cluster B has sufficient resources, the replicas of cluster B cannot be scheduled because (if any cluster with insufficient resources, fails the schedule). This is quite different from our current usage habits.
Before the static weight optimization : Cluster A has insufficient resources, and the replica of Cluster A enters the Pending state. However, Cluster B has sufficient resources to scaling, it can meet the emergency scaling needs.
After the static weight optimization : Cluster B has sufficient resources, but cluster A has insufficient resources, which causes the scheduling of clusters A and B to fail. According to the above optimization, if I understand correctly, the scaling will never succeed unless the StaticWeight cluster ratio in CPP is changed. In this scenario, scaling requires modifying both the number of replicas on the resource template and the cluster weight ratio on the CPP. This introduces complexity into emergency scaling scenarios.
Therefore, I think it is simpler and more efficient to maintain the current static weight behavior (doesn't take the available replicas in the cluster into account).
I hope this feedback is helpful to you.
thank you very much for your two provided scenarios!
Cluster B has sufficient resources, but cluster A has insufficient resources, which causes the scheduling of clusters A and B to fail. According to the above optimization, if I understand correctly, the scaling will never succeed unless the StaticWeight cluster ratio in CPP is changed.
I don't understand this sentence well, hi @whitewindmills, is this consistent with what you proposed?
I can't say for sure yet cause it's pending now, but I'd like to share my options. usually we are mainly concerned about whether the selected cluster can accommodate the replicas to be allocated. as for whether the final result is exactly in line with the proportion of the static weight setting, it is not so important. we are likely to make an approximate assignment.
Just put this issue into the Karmada backlog, I think we can discuss it at one of the Community meetings to see how to move this forward.
@weidalin @so2bin I'm not sure if the time slot meets you well. Please find a time that is suitable for you and add an agenda to the Meeting Notes. (Note: By joining the google groups you will be able to edit the meeting notes. Join google group mailing list: https://groups.google.com/forum/#!forum/karmada)
Just put this issue into the Karmada backlog, I think we can discuss it at one of the Community meetings to see how to move this forward.
@weidalin @so2bin I'm not sure if the time slot meets you well. Please find a time that is suitable for you and add an agenda to the Meeting Notes. (Note: By joining the google groups you will be able to edit the meeting notes. Join google group mailing list: https://groups.google.com/forum/#!forum/karmada)
Hello, we have added an agenda to the Meeting Notes of the 2024-09-24 meeting.
Please provide an in-depth description of the question you have:
The current WorkloadRebalancer (#4698) provides a great entry point for rescheduling workloads, allowing the use of .spec.workloads to specify the resources for rescheduling and supporting array-based scheduling. I would like to ask if there are any plans for WorkloadRebalancer to support resourceSelectors, similar to what is supported in PropagationPolicy?
For example:
This would allow me to reschedule resources for the same team based on labels, making workload rescheduling more efficient.
What do you think about this question?:
I believe this feature would make the WorkloadRebalancer even more flexible, allowing dynamic resource selection through label selectors, similar to what PropagationPolicy offers.
Environment:
Karmada version: v1.10.4 Kubernetes version: v1.25.6
@chaosi-zju