If there are any plans for WorkloadRebalancer to support resourceSelectors, similar to what is supported in PropagationPolicy

karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration

https://karmada.io

Apache License 2.0

4.44k stars 880 forks source link

If there are any plans for WorkloadRebalancer to support resourceSelectors, similar to what is supported in PropagationPolicy #5527

Open weidalin opened 1 month ago

weidalin commented 1 month ago

Please provide an in-depth description of the question you have:

The current WorkloadRebalancer (#4698) provides a great entry point for rescheduling workloads, allowing the use of .spec.workloads to specify the resources for rescheduling and supporting array-based scheduling. I would like to ask if there are any plans for WorkloadRebalancer to support resourceSelectors, similar to what is supported in PropagationPolicy?

For example:

apiVersion: apps.karmada.io/v1alpha1
kind: WorkloadRebalancer
metadata:
  name: demo
spec:
  workloads:
  - apiVersion: v1
    kind: ConfigMap
    labelSelector:
      matchLabels:
        atms-app/cluster-policy: karmada
        atms-app/creator: ATMS
        atms-app/teamid: "1005"

This would allow me to reschedule resources for the same team based on labels, making workload rescheduling more efficient.

What do you think about this question?:

I believe this feature would make the WorkloadRebalancer even more flexible, allowing dynamic resource selection through label selectors, similar to what PropagationPolicy offers.

Environment:

Karmada version: v1.10.4 Kubernetes version: v1.25.6

@chaosi-zju

chaosi-zju commented 1 month ago

Hi @weidalin, thank you for your interest in this feature.

Actually, this feature was originally designed to support batch selecting of resources through resourceSelector. However, during the later discussion, we were worried that batch rescheduling was a dangerous operation that users maybe not dare to use or could easily misuse, so we deliberately cancelled this entrance.

Simply put, there was a lack of user's support at that time. If there are more real users calling for resourceSelector, we are very happy to consider supporting this ability.

chaosi-zju commented 1 month ago

I'm thinking of moving forward with the plan you mentioned, could you provide me with some information.

Could you briefly describe the scenarios in which you are using Karmada? And what scenarios do users need to use the rescheduling by WorkloadRebalancer ?
In your above "allow me to reschedule resources for the same team", what's the relationship between "me" and "team"? Is that mean you are the role like cluster adminstrator of a platform, and many team deploy apps on your platform, in some case you need to do the rescheduling for one whole team? If so, do you worry about the impact of batch rescheduling on users?
What are the current pain points for rebalancers? Compared with using resourceSelector, is it just a bit more troublesome to write the CR of WorkloadRebalancer using array-based way, or as your cluster adminstrator role, you actually not perceive the user's resources?
By the way, could you please let us know which company you are from (we'd like to confirm if it's already listed among our adopters)

weidalin commented 1 month ago

I'm thinking of moving forward with the plan you mentioned, could you provide me with some information.

Could you briefly describe the scenarios in which you are using Karmada? And what scenarios do users need to use the rescheduling by WorkloadRebalancer ? We are an AI training and inference platform. During the daytime, Cluster A handles both training and inference tasks, while Cluster B handles only inference tasks. During the nighttime, when inference traffic is low, we want to free up resources in Cluster A for training only, and consolidate inference tasks to Cluster B. Therefore, at 11:00 PM and 7:00 AM, we need to perform a rescheduling to move the inference workloads.

In your above "allow me to reschedule resources for the same team", what's the relationship between "me" and "team"? Is that mean you are the role like cluster adminstrator of a platform, and many team deploy apps on your platform, in some case you need to do the rescheduling for one whole team? If so, do you worry about the impact of batch rescheduling on users? Yes, we are an AI training and inference platform where many teams deploy their applications. We would like to reschedule applications based on team labels. We try to minimize the impact of rescheduling on users, as we ensure that the replica count in Cluster B does not scale to zero. We carefully manage the transition so that there are always replicas available in Cluster B before and after the rescheduling.

What are the current pain points for rebalancers? Compared with using resourceSelector, is it just a bit more troublesome to write the CR of WorkloadRebalancer using array-based way, or as your cluster adminstrator role, you actually not perceive the user's resources? The pain point is that there are too many applications within a team. If we specify the resources to be rescheduled using the current array-based approach in WorkloadRebalancer, the .spec.workloads list will be very long. Moreover, applications in a team may be added or removed. If we can use labels to select resources, it would be much more convenient.

By the way, could you please let us know which company you are from (we'd like to confirm if it's already listed among our adopters) Thank you very much for your support. We are Zhuhai Kingsoft Office Software Co., Ltd.

weidalin commented 1 month ago

Hi @weidalin, thank you for your interest in this feature.

Actually, this feature was originally designed to support batch selecting of resources through resourceSelector. However, during the later discussion, we were worried that batch rescheduling was a dangerous operation that users maybe not dare to use or could easily misuse, so we deliberately cancelled this entrance.

Simply put, there was a lack of user's support at that time. If there are more real users calling for resourceSelector, we are very happy to consider supporting this ability.

Hi @chaosi-zju thank you very much for your reply. Could you please let us know if there is a branch of the WorkloadRebalancer that already supports ResourceSelectors? If so, could you share it with us? If not, what are your future development plans regarding this feature?

Additionally, when you mentioned "we were worried that batch rescheduling was a dangerous operation," could you clarify what kind of scenarios you were referring to?

chaosi-zju commented 1 month ago

Could you please let us know if there is a branch of the WorkloadRebalancer that already supports ResourceSelectors?

Sorry, not yet.

If not, what are your future development plans regarding this feature?

The scene you provided is very interesting. I think we will decide as soon as possible whether to support this ability or not.

could you clarify what kind of scenarios you were referring to?

I mean replicas distribution changes dramatically.

We previously mainly supported deployment, which is a bit different from inference training tasks. In that case, the user does not want the pods which are running fine to undergo major changes, however, labelselector way can involve in many pods, which brings risk of great changes to the overall replicas distribution, thereby it will shock the system to some extent.

so2bin commented 1 month ago

Our scenario involves achieving real-time training and inference elasticity or computational bursts between day time and night time, so we need to implement batch cross-cluster re-scheduling for inference deployments.
Now we have decided to use the CPP with DynamicWeight strategy to distribute the replicas to worker clusters.
The risks you mentioned are present, and we need to exercise some control during the actual use of batch-rebalance to avoid system issues caused by large-scale simultaneous scheduling of pods. @chaosi-zju

RainbowMango commented 1 month ago

Hi @weidalin @so2bin, thanks for the feedback and input. I think we can iterate the WorkloadRebalancer based on your scenario.

The WorkloadRebalancer just takes responsibility for triggering the re-schedule, given the impact of re-scheduling might be significant, the replica distribution might vary greatly, for example, all replicas could be migrated from the origin cluster, the service quality might be a challenge in that case, you need to make sure the load-balancer across clusters well-configured, so it needs to be used very carefully. In addition, we don't have such a use case that expects batch re-schedule before, that's the reason why we don't support that.

We might need to ask a few more questions to better understand your use case.

As @so2bin mentioned above, you decided to use CPP with DynamicWeight strategy, can you share a copy of CPP you are using?

Is there a load balancer across clusters in your case? Since you are re-scheduling inference workloads, how do you handle the traffic upon replicas in two clusters?

so2bin commented 1 month ago

@chaosi-zju I hope the following words may help you to understand our scenario, thanks.

Backgroud

We are using Karmada as a multi-site high availability solution for online multi-cluster AI inference platform. Now we have 3 GPU clusters online, and there are many development temas deploy AI services on our platform.
I will now explain our platform's 2-phase design from the perspective of two clusters: the current online version which uses the StaticWeight strategy, and the next version design, which will utilize WorkloadRebalancer and DynamicWeight strategy.

Current online

design

About the traffic load-balancer: we have not use the MultiClusterService CRD of Karmada, but used the istio's Multi-Primary Multi-Networks mesh network to implement the traffic load-balance between the clusters. So each deployment(AI inference service) will has a istio DestinationRule for load balance, like this:

There is a custom-controller to synchronize the clusters' traficc percent from the Endpoints in worker clusters.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: starcode-triton-server-v3
namespace: ai-app
spec:
host: starcode-triton-server-v3.ai-app.svc.cluster.local
trafficPolicy:
connectionPool:
...
loadBalancer:
localityLbSetting:
distribute:
- from: <Cluster-Name-A>/*
to:
<Cluster-Name-A>/*: 80
<Cluster-Name-E>/*: 20
...
- from: <Cluster-Name-E>/*
to:
<Cluster-Name-A>/*: 80
<Cluster-Name-E>/*: 20
...
- ... # there can extend more clusters here
outlierDetection:
...

StaticWeight CPP: each team use a CPP to distribute the pods to worker clusters, the CPP demo like this:

apiVersion: policy.karmada.io/v1alpha1
kind: ClusterPropagationPolicy
metadata:
labels:
...
name: team-1005-app-0-cpp-default-day-time
spec:
conflictResolution: Abort
placement:
#...
replicaScheduling:
  replicaDivisionPreference: Weighted
  replicaSchedulingType: Divided
  weightPreference:
    staticWeightList:
    - targetCluster:
        clusterNames:
        - <Custer-Name-A>
      weight: 80
    - targetCluster:
        clusterNames:
        - <Custer-Name-E>
      weight: 20
    - ... # more clusters
preemption: Never
priority: 30
resourceSelectors:
- apiVersion: v1
kind: ConfigMap  # we use cm but CRD
labelSelector:
  matchLabels:
    atms-app/appid: "10471"
    atms-app/teamid: "1005"
    ...

How to do computational bursts: we have two timer which cronly triggerd at 07:00, 23:00 every day, it will switch between the day-time CPP and the nigh-time CPP with delete and create CPP (here we donot use CPP preemption).

Yes, as you mentioned, there is a traffic risk if all replicas are migrated from one cluster to another cluster. However, in our scenario, the CPPs are managed by the platform manager, so we will take care of it and avoid this problem.

Next version

design

The previous design has some drawbacks. Each deployment's propagation always uses the specified team CPP with static weight distribution. This can lead to a pending problem where a pod is pending in Cluster-A, but Cluster-B has enough available GPUs.
Therefore, we plan to implement the DynamicWeight strategy for this purpose. To achieve this, we need to make two extensions:
1. We need to extend a custom karmada-estimator to collect team-level resources within clusters and use this data to assist the karmada-scheduler in calculating the ClusterWeight (replicas) between clusters.
2. we need to implement the re-scheduler with the DynamicWeight CPP for the daily timer. Fortunately, the Karmada v1.10 release includes a WorkloadRebalancer CRD, which allows us to get it if WorkloadRebalancer supports batch-rebalance feature.

So the new team's CPP may be like this:

apiVersion: policy.karmada.io/v1alpha1
kind: ClusterPropagationPolicy
metadata:
labels:
...
name: team-1005-app-0-cpp-default-day-time
spec:
conflictResolution: Abort
placement:
#...
replicaScheduling:
  replicaDivisionPreference: Weighted
  replicaSchedulingType: Divided
  weightPreference:
    # Here we do not use `AvailableReplicas` but extend a new DynamicWeightFactor for the custom-estimator use case. 
    # In the future, we may push a PR to explain it.
    dynamicWeight:  EstimatorCalculatedReplicas 
preemption: Never
priority: 30
resourceSelectors:
- apiVersion: v1
kind: ConfigMap  # we use cm but CRD
labelSelector:
  matchLabels:
    atms-app/appid: "10471"
    atms-app/teamid: "1005"
    ...

RainbowMango commented 1 month ago

The previous design has some drawbacks. Each deployment's propagation always uses the specified team CPP with static weight distribution. This can lead to a pending problem where a pod is pending in Cluster-A, but Cluster-B has enough available GPUs.

That is due to the StaticWeight mode doesn't consider available resources during the replica assignment. Coincidentally, we are discussing if we need to enhance the StaticWeight to let it take available resources at https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217. May I ask what's your opinion on this?

How to do computational bursts: we have two timer which cronly triggerd at 07:00, 23:00 every day, it will switch between the day-time CPP and the nigh-time CPP with delete and create CPP (here we donot use CPP preemption).

Change the CPP also involves re-schedule, so you don't need WorkloadRebalancer. In your next plan, you are going to use DynamicWeight, changing the CPP won't get re-scheduled, so you need a way to force reschedule, am I right?

chaosi-zju commented 1 month ago

Hi @so2bin @weidalin , thank you very much for the above valuable practice you provided. I am amazed at the depth of your exploration and the clear and detailed reply 👍 .

I think I understand why you have this appeal. But I'm interested in a question: as you said above, one sentence is "each team use a CPP to distribute the pods to worker clusters", another is "the CPPs are managed by the platform manager", I wonder who is exactly in charge of this cpp. Do you mean the CPP is declared by user, but the operation of apply/delete is in charged by platform manager?

chaosi-zju commented 1 month ago

enhance the StaticWeight to let it take available resources

it seems they not only wants to take available resources account, but also use the custom estimator to get more accurate results for their scenarios.

so2bin commented 1 month ago

Change the CPP also involves re-schedule, so you don't need WorkloadRebalancer. In your next plan, you are going to use DynamicWeight, changing the CPP won't get re-scheduled, so you need a way to force reschedule, am I right?

@RainbowMango Apologies for the delayed response. We have conducted further verification by performing a delete and create operation on the CPP with DynamicWeight type. The result shows that this operation does not trigger resource re-scheduling, as the placement remains unchanged: https://github.com/karmada-io/karmada/blob/ba360c9aa7e24ffb1fffdcb14dd2b828a74fbafa/pkg/scheduler/helper.go#L51

so2bin commented 1 month ago

Hi @so2bin @weidalin , thank you very much for the above valuable practice you provided. I am amazed at the depth of your exploration and the clear and detailed reply 👍 .

I think I understand why you have this appeal. But I'm interested in a question: as you said above, one sentence is "each team use a CPP to distribute the pods to worker clusters", another is "the CPPs are managed by the platform manager", I wonder who is exactly in charge of this cpp. Do you mean the CPP is declared by user, but the operation of apply/delete is in charged by platform manager?

@chaosi-zju We are platform maintainers and administrators, and we have full operational permission over team CPPs. These CPPs are created and managed by us. For platform users, the creation and changes of CPPs are seamless and transparent. Therefore, we also hope that the process of cluster-level computational burst scheduling is seamless and transparent to platform users.

so2bin commented 1 month ago

enhance the StaticWeight to let it take available resources

it seems they not only wants to take available resources account, but also use the custom estimator to get more accurate results for their scenarios.

Yes, we need to take into account a fine team-level resource distribution from custom-custimator server.

RainbowMango commented 1 month ago

We are platform maintainers and administrators, and we have full operational permission over team CPPs. These CPPs are created and managed by us. For platform users, the creation and changes of CPPs are seamless and transparent.

I guess each team probably has multiple applications on your platform, just out of curiosity, do you manage one CPP per team or one CPP per application?

so2bin commented 1 month ago

We are platform maintainers and administrators, and we have full operational permission over team CPPs. These CPPs are created and managed by us. For platform users, the creation and changes of CPPs are seamless and transparent.

I guess each team probably has multiple applications on your platform, just out of curiosity, do you manage one CPP per team or one CPP per application?

@RainbowMango By default, each team has one team-level CPP, and a few exceptional app use app-level CPPs, which have higher priority.

chaosi-zju commented 1 month ago

Hi @so2bin, thank you for your above explanation~

there is a traffic risk if all replicas are migrated from one cluster to another cluster. However, in our scenario, the CPPs are managed by the platform manager, so we will take care of it and avoid this problem.

A new question comes, as the role of platform manager, is there any detail way to "take care of it and avoid this problem"? And, if you don't know much about the specific team apps, how do you judge that the team's apps are not affected?

If your method is common or inspiring for most other people, I think we don't need to concern so much and just start to push forward the feature you mentioned.

weidalin commented 1 month ago

That is due to the StaticWeight mode doesn't consider available resources during the replica assignment. Coincidentally, we are discussing if we need to enhance the StaticWeight to let it take available resources at [[#4805 (comment)](https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217)](https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217). May I ask what's your opinion on this?

After carefully reviewing [[#4805 (comment)](https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217)](https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217), I believe the two main issues and optimizations regarding static weight are as described below. Is my understanding correct?

@whitewindmills Whether it's the static weight strategy or this AverageReplicas strategy, they are both ways of assigning replicas. Currently, the static weights strategy has two main "disadvantages":

1 It doesn't comply with the spread constraints. 2 It doesn't take the available replicas in the cluster into account.

@RainbowMango's view on this feature is that we could try to enhance the existing staticWeight. The possible improvements could be:

1 Make static weight consider spread constraints when selecting target clusters. 2 Make static weight take available resources into account (i.e., if any cluster lacks sufficient resources, the scheduling fails).

We also encounter these problems when using a static weighting strategy. I agree that making static weight take available resources into account (if any cluster lacks sufficient resources, the scheduling fails) is a good approach to avoid Pods Pending, but simply letting scheduling fail may not be a good approach.

I would like to explain the impact of this change based on the actual usage scenarios of our AI applications. I hope it can help you evaluate this static weight change.

Tidal Cluster Weight Switching Scenario Insufficient resources in any cluster will cause this scheduling to fail.
- Before the static weight optimization : Some Pods in clusters with insufficient resources will enter Pending, while clusters with sufficient resources will successfully complete the Tidal cluster weight switch.
- After the static weight optimization : Because any cluster has insufficient resources, the tidal cluster weight switch has not been completed.

In this scenario, static weight optimization can help us avoid entering the Pending state, which is indeed an optimization method. But it cannot satisfy our need to switch the weight of the tidal cluster.

In the tidal cluster weight switching scenario, it is more suitable for us to use dynamic weight and WorkloadRebalancer in combination. So we hope that WorkloadRebalancer can support LabelSelector.

Scaling Application Replicas Scenario If cluster A has insufficient resources while cluster B has sufficient resources, the replicas of cluster B cannot be scheduled because (if any cluster with insufficient resources, fails the schedule). This is quite different from our current usage habits.
- Before the static weight optimization : Cluster A has insufficient resources, and the replica of Cluster A enters the Pending state. However, Cluster B has sufficient resources to scaling, it can meet the emergency scaling needs.
- After the static weight optimization : Cluster B has sufficient resources, but cluster A has insufficient resources, which causes the scheduling of clusters A and B to fail. According to the above optimization, if I understand correctly, the scaling will never succeed unless the StaticWeight cluster ratio in CPP is changed. In this scenario, scaling requires modifying both the number of replicas on the resource template and the cluster weight ratio on the CPP. This introduces complexity into emergency scaling scenarios.

Therefore, I think it is simpler and more efficient to maintain the current static weight behavior (doesn't take the available replicas in the cluster into account).

I hope this feedback is helpful to you.

chaosi-zju commented 1 month ago

thank you very much for your two provided scenarios!

Cluster B has sufficient resources, but cluster A has insufficient resources, which causes the scheduling of clusters A and B to fail. According to the above optimization, if I understand correctly, the scaling will never succeed unless the StaticWeight cluster ratio in CPP is changed.

I don't understand this sentence well, hi @whitewindmills, is this consistent with what you proposed?

whitewindmills commented 1 month ago

I can't say for sure yet cause it's pending now, but I'd like to share my options. usually we are mainly concerned about whether the selected cluster can accommodate the replicas to be allocated. as for whether the final result is exactly in line with the proportion of the static weight setting, it is not so important. we are likely to make an approximate assignment.

RainbowMango commented 1 month ago

Just put this issue into the Karmada backlog, I think we can discuss it at one of the Community meetings to see how to move this forward.

@weidalin @so2bin I'm not sure if the time slot meets you well. Please find a time that is suitable for you and add an agenda to the Meeting Notes. (Note: By joining the google groups you will be able to edit the meeting notes. Join google group mailing list: https://groups.google.com/forum/#!forum/karmada)

weidalin commented 1 month ago

Just put this issue into the Karmada backlog, I think we can discuss it at one of the Community meetings to see how to move this forward.

@weidalin @so2bin I'm not sure if the time slot meets you well. Please find a time that is suitable for you and add an agenda to the Meeting Notes. (Note: By joining the google groups you will be able to edit the meeting notes. Join google group mailing list: https://groups.google.com/forum/#!forum/karmada)

Hello, we have added an agenda to the Meeting Notes of the 2024-09-24 meeting.