karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.44k stars 880 forks source link

If there are any plans for WorkloadRebalancer to support resourceSelectors, similar to what is supported in PropagationPolicy #5527

Open weidalin opened 1 month ago

weidalin commented 1 month ago

Please provide an in-depth description of the question you have:

The current WorkloadRebalancer (#4698) provides a great entry point for rescheduling workloads, allowing the use of .spec.workloads to specify the resources for rescheduling and supporting array-based scheduling. I would like to ask if there are any plans for WorkloadRebalancer to support resourceSelectors, similar to what is supported in PropagationPolicy?

For example:

apiVersion: apps.karmada.io/v1alpha1
kind: WorkloadRebalancer
metadata:
  name: demo
spec:
  workloads:
  - apiVersion: v1
    kind: ConfigMap
    labelSelector:
      matchLabels:
        atms-app/cluster-policy: karmada
        atms-app/creator: ATMS
        atms-app/teamid: "1005"

This would allow me to reschedule resources for the same team based on labels, making workload rescheduling more efficient.

What do you think about this question?:

I believe this feature would make the WorkloadRebalancer even more flexible, allowing dynamic resource selection through label selectors, similar to what PropagationPolicy offers.

Environment:

Karmada version: v1.10.4 Kubernetes version: v1.25.6

@chaosi-zju

chaosi-zju commented 1 month ago

Hi @weidalin, thank you for your interest in this feature.

Actually, this feature was originally designed to support batch selecting of resources through resourceSelector. However, during the later discussion, we were worried that batch rescheduling was a dangerous operation that users maybe not dare to use or could easily misuse, so we deliberately cancelled this entrance.

Simply put, there was a lack of user's support at that time. If there are more real users calling for resourceSelector, we are very happy to consider supporting this ability.

chaosi-zju commented 1 month ago

I'm thinking of moving forward with the plan you mentioned, could you provide me with some information.

weidalin commented 1 month ago

I'm thinking of moving forward with the plan you mentioned, could you provide me with some information.

  • Could you briefly describe the scenarios in which you are using Karmada? And what scenarios do users need to use the rescheduling by WorkloadRebalancer ? We are an AI training and inference platform. During the daytime, Cluster A handles both training and inference tasks, while Cluster B handles only inference tasks. During the nighttime, when inference traffic is low, we want to free up resources in Cluster A for training only, and consolidate inference tasks to Cluster B. Therefore, at 11:00 PM and 7:00 AM, we need to perform a rescheduling to move the inference workloads.

  • In your above "allow me to reschedule resources for the same team", what's the relationship between "me" and "team"? Is that mean you are the role like cluster adminstrator of a platform, and many team deploy apps on your platform, in some case you need to do the rescheduling for one whole team? If so, do you worry about the impact of batch rescheduling on users? Yes, we are an AI training and inference platform where many teams deploy their applications. We would like to reschedule applications based on team labels. We try to minimize the impact of rescheduling on users, as we ensure that the replica count in Cluster B does not scale to zero. We carefully manage the transition so that there are always replicas available in Cluster B before and after the rescheduling.

  • What are the current pain points for rebalancers? Compared with using resourceSelector, is it just a bit more troublesome to write the CR of WorkloadRebalancer using array-based way, or as your cluster adminstrator role, you actually not perceive the user's resources? The pain point is that there are too many applications within a team. If we specify the resources to be rescheduled using the current array-based approach in WorkloadRebalancer, the .spec.workloads list will be very long. Moreover, applications in a team may be added or removed. If we can use labels to select resources, it would be much more convenient.

  • By the way, could you please let us know which company you are from (we'd like to confirm if it's already listed among our adopters) Thank you very much for your support. We are Zhuhai Kingsoft Office Software Co., Ltd.

weidalin commented 1 month ago

Hi @weidalin, thank you for your interest in this feature.

Actually, this feature was originally designed to support batch selecting of resources through resourceSelector. However, during the later discussion, we were worried that batch rescheduling was a dangerous operation that users maybe not dare to use or could easily misuse, so we deliberately cancelled this entrance.

Simply put, there was a lack of user's support at that time. If there are more real users calling for resourceSelector, we are very happy to consider supporting this ability.

Hi @chaosi-zju thank you very much for your reply. Could you please let us know if there is a branch of the WorkloadRebalancer that already supports ResourceSelectors? If so, could you share it with us? If not, what are your future development plans regarding this feature?

Additionally, when you mentioned "we were worried that batch rescheduling was a dangerous operation," could you clarify what kind of scenarios you were referring to?

chaosi-zju commented 1 month ago

Could you please let us know if there is a branch of the WorkloadRebalancer that already supports ResourceSelectors?

Sorry, not yet.

If not, what are your future development plans regarding this feature?

The scene you provided is very interesting. I think we will decide as soon as possible whether to support this ability or not.

could you clarify what kind of scenarios you were referring to?

I mean replicas distribution changes dramatically.

We previously mainly supported deployment, which is a bit different from inference training tasks. In that case, the user does not want the pods which are running fine to undergo major changes, however, labelselector way can involve in many pods, which brings risk of great changes to the overall replicas distribution, thereby it will shock the system to some extent.

so2bin commented 1 month ago
RainbowMango commented 1 month ago

Hi @weidalin @so2bin, thanks for the feedback and input. I think we can iterate the WorkloadRebalancer based on your scenario.

The WorkloadRebalancer just takes responsibility for triggering the re-schedule, given the impact of re-scheduling might be significant, the replica distribution might vary greatly, for example, all replicas could be migrated from the origin cluster, the service quality might be a challenge in that case, you need to make sure the load-balancer across clusters well-configured, so it needs to be used very carefully. In addition, we don't have such a use case that expects batch re-schedule before, that's the reason why we don't support that.

We might need to ask a few more questions to better understand your use case.

As @so2bin mentioned above, you decided to use CPP with DynamicWeight strategy, can you share a copy of CPP you are using?

Is there a load balancer across clusters in your case? Since you are re-scheduling inference workloads, how do you handle the traffic upon replicas in two clusters?

so2bin commented 1 month ago

@chaosi-zju I hope the following words may help you to understand our scenario, thanks.

Backgroud

Current online

design

Yes, as you mentioned, there is a traffic risk if all replicas are migrated from one cluster to another cluster. However, in our scenario, the CPPs are managed by the platform manager, so we will take care of it and avoid this problem.

Next version

design

RainbowMango commented 1 month ago

The previous design has some drawbacks. Each deployment's propagation always uses the specified team CPP with static weight distribution. This can lead to a pending problem where a pod is pending in Cluster-A, but Cluster-B has enough available GPUs.

That is due to the StaticWeight mode doesn't consider available resources during the replica assignment. Coincidentally, we are discussing if we need to enhance the StaticWeight to let it take available resources at https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217. May I ask what's your opinion on this?

How to do computational bursts: we have two timer which cronly triggerd at 07:00, 23:00 every day, it will switch between the day-time CPP and the nigh-time CPP with delete and create CPP (here we donot use CPP preemption).

Change the CPP also involves re-schedule, so you don't need WorkloadRebalancer. In your next plan, you are going to use DynamicWeight, changing the CPP won't get re-scheduled, so you need a way to force reschedule, am I right?

chaosi-zju commented 1 month ago

Hi @so2bin @weidalin , thank you very much for the above valuable practice you provided. I am amazed at the depth of your exploration and the clear and detailed reply 👍 .

I think I understand why you have this appeal. But I'm interested in a question: as you said above, one sentence is "each team use a CPP to distribute the pods to worker clusters", another is "the CPPs are managed by the platform manager", I wonder who is exactly in charge of this cpp. Do you mean the CPP is declared by user, but the operation of apply/delete is in charged by platform manager?

chaosi-zju commented 1 month ago

enhance the StaticWeight to let it take available resources

it seems they not only wants to take available resources account, but also use the custom estimator to get more accurate results for their scenarios.

so2bin commented 1 month ago

Change the CPP also involves re-schedule, so you don't need WorkloadRebalancer. In your next plan, you are going to use DynamicWeight, changing the CPP won't get re-scheduled, so you need a way to force reschedule, am I right?

@RainbowMango Apologies for the delayed response. We have conducted further verification by performing a delete and create operation on the CPP with DynamicWeight type. The result shows that this operation does not trigger resource re-scheduling, as the placement remains unchanged: https://github.com/karmada-io/karmada/blob/ba360c9aa7e24ffb1fffdcb14dd2b828a74fbafa/pkg/scheduler/helper.go#L51

so2bin commented 1 month ago

Hi @so2bin @weidalin , thank you very much for the above valuable practice you provided. I am amazed at the depth of your exploration and the clear and detailed reply 👍 .

I think I understand why you have this appeal. But I'm interested in a question: as you said above, one sentence is "each team use a CPP to distribute the pods to worker clusters", another is "the CPPs are managed by the platform manager", I wonder who is exactly in charge of this cpp. Do you mean the CPP is declared by user, but the operation of apply/delete is in charged by platform manager?

@chaosi-zju We are platform maintainers and administrators, and we have full operational permission over team CPPs. These CPPs are created and managed by us. For platform users, the creation and changes of CPPs are seamless and transparent. Therefore, we also hope that the process of cluster-level computational burst scheduling is seamless and transparent to platform users.

so2bin commented 1 month ago

enhance the StaticWeight to let it take available resources

it seems they not only wants to take available resources account, but also use the custom estimator to get more accurate results for their scenarios.

Yes, we need to take into account a fine team-level resource distribution from custom-custimator server.

RainbowMango commented 1 month ago

We are platform maintainers and administrators, and we have full operational permission over team CPPs. These CPPs are created and managed by us. For platform users, the creation and changes of CPPs are seamless and transparent.

I guess each team probably has multiple applications on your platform, just out of curiosity, do you manage one CPP per team or one CPP per application?

so2bin commented 1 month ago

We are platform maintainers and administrators, and we have full operational permission over team CPPs. These CPPs are created and managed by us. For platform users, the creation and changes of CPPs are seamless and transparent.

I guess each team probably has multiple applications on your platform, just out of curiosity, do you manage one CPP per team or one CPP per application?

@RainbowMango By default, each team has one team-level CPP, and a few exceptional app use app-level CPPs, which have higher priority.

chaosi-zju commented 1 month ago

Hi @so2bin, thank you for your above explanation~

there is a traffic risk if all replicas are migrated from one cluster to another cluster. However, in our scenario, the CPPs are managed by the platform manager, so we will take care of it and avoid this problem.

A new question comes, as the role of platform manager, is there any detail way to "take care of it and avoid this problem"? And, if you don't know much about the specific team apps, how do you judge that the team's apps are not affected?

If your method is common or inspiring for most other people, I think we don't need to concern so much and just start to push forward the feature you mentioned.

weidalin commented 1 month ago

That is due to the StaticWeight mode doesn't consider available resources during the replica assignment. Coincidentally, we are discussing if we need to enhance the StaticWeight to let it take available resources at [[#4805 (comment)](https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217)](https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217). May I ask what's your opinion on this?

After carefully reviewing [[#4805 (comment)](https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217)](https://github.com/karmada-io/karmada/issues/4805#issuecomment-2283607217), I believe the two main issues and optimizations regarding static weight are as described below. Is my understanding correct?

@whitewindmills Whether it's the static weight strategy or this AverageReplicas strategy, they are both ways of assigning replicas. Currently, the static weights strategy has two main "disadvantages":

1 It doesn't comply with the spread constraints. 2 It doesn't take the available replicas in the cluster into account.

@RainbowMango's view on this feature is that we could try to enhance the existing staticWeight. The possible improvements could be:

1 Make static weight consider spread constraints when selecting target clusters. 2 Make static weight take available resources into account (i.e., if any cluster lacks sufficient resources, the scheduling fails).

We also encounter these problems when using a static weighting strategy. I agree that making static weight take available resources into account (if any cluster lacks sufficient resources, the scheduling fails) is a good approach to avoid Pods Pending, but simply letting scheduling fail may not be a good approach.

I would like to explain the impact of this change based on the actual usage scenarios of our AI applications. I hope it can help you evaluate this static weight change.

  1. Tidal Cluster Weight Switching Scenario Insufficient resources in any cluster will cause this scheduling to fail.

    • Before the static weight optimization : Some Pods in clusters with insufficient resources will enter Pending, while clusters with sufficient resources will successfully complete the Tidal cluster weight switch.

    • After the static weight optimization : Because any cluster has insufficient resources, the tidal cluster weight switch has not been completed.

In this scenario, static weight optimization can help us avoid entering the Pending state, which is indeed an optimization method. But it cannot satisfy our need to switch the weight of the tidal cluster.

In the tidal cluster weight switching scenario, it is more suitable for us to use dynamic weight and WorkloadRebalancer in combination. So we hope that WorkloadRebalancer can support LabelSelector.

  1. Scaling Application Replicas Scenario If cluster A has insufficient resources while cluster B has sufficient resources, the replicas of cluster B cannot be scheduled because (if any cluster with insufficient resources, fails the schedule). This is quite different from our current usage habits.

    • Before the static weight optimization : Cluster A has insufficient resources, and the replica of Cluster A enters the Pending state. However, Cluster B has sufficient resources to scaling, it can meet the emergency scaling needs.

    • After the static weight optimization : Cluster B has sufficient resources, but cluster A has insufficient resources, which causes the scheduling of clusters A and B to fail. According to the above optimization, if I understand correctly, the scaling will never succeed unless the StaticWeight cluster ratio in CPP is changed. In this scenario, scaling requires modifying both the number of replicas on the resource template and the cluster weight ratio on the CPP. This introduces complexity into emergency scaling scenarios.

Therefore, I think it is simpler and more efficient to maintain the current static weight behavior (doesn't take the available replicas in the cluster into account).

I hope this feedback is helpful to you.

chaosi-zju commented 1 month ago

thank you very much for your two provided scenarios!

Cluster B has sufficient resources, but cluster A has insufficient resources, which causes the scheduling of clusters A and B to fail. According to the above optimization, if I understand correctly, the scaling will never succeed unless the StaticWeight cluster ratio in CPP is changed.

I don't understand this sentence well, hi @whitewindmills, is this consistent with what you proposed?

whitewindmills commented 1 month ago

I can't say for sure yet cause it's pending now, but I'd like to share my options. usually we are mainly concerned about whether the selected cluster can accommodate the replicas to be allocated. as for whether the final result is exactly in line with the proportion of the static weight setting, it is not so important. we are likely to make an approximate assignment.

RainbowMango commented 1 month ago

Just put this issue into the Karmada backlog, I think we can discuss it at one of the Community meetings to see how to move this forward.

@weidalin @so2bin I'm not sure if the time slot meets you well. Please find a time that is suitable for you and add an agenda to the Meeting Notes. (Note: By joining the google groups you will be able to edit the meeting notes. Join google group mailing list: https://groups.google.com/forum/#!forum/karmada)

weidalin commented 1 month ago

Just put this issue into the Karmada backlog, I think we can discuss it at one of the Community meetings to see how to move this forward.

@weidalin @so2bin I'm not sure if the time slot meets you well. Please find a time that is suitable for you and add an agenda to the Meeting Notes. (Note: By joining the google groups you will be able to edit the meeting notes. Join google group mailing list: https://groups.google.com/forum/#!forum/karmada)

Hello, we have added an agenda to the Meeting Notes of the 2024-09-24 meeting.