[proposal] add a schedule plugin that support pod expands and shrinks according to the order of the defined logical node set

fjding commented 1 year ago

The kubernetes support pod-deletion-cost after v1.21， in my cloud scene， the user have the demands like this： 1、Define multiple logical node set, a deployment workload can schedule pods according to this node set order, and shrink in the opposite order at the same time. 2、At the same time, it also supports the maximum number of schedulable pods per node set。 BTW, I have implemented this feature and want to contribute it to the community, I hope everyone can discuss it together

KunWuLuan commented 1 year ago

My company also have a similar plugin like yours. We can find a time to have a discuss.

fjding commented 1 year ago

My company also have a similar plugin like yours. We can find a time to have a discuss.

Hi, we can collaborate on this proposal

fjding commented 1 year ago

@Huang-Wei @ffromani @seanmalloy @denkensk Could you take a look at this proposal? We can discuss whether we need to create a keps

ffromani commented 1 year ago

@Huang-Wei @ffromani @seanmalloy @denkensk Could you take a look at this proposal? We can discuss whether we need to create a keps

I'll have a look later this week (beginning April 3 2023)

Huang-Wei commented 1 year ago

It will help us understanding the motivation(s) if you can elaborate on the real-world use-cases.

1、Define multiple logical node set, a deployment workload can schedule pods according to this node set order, and shrink in the opposite order at the same time.

What do you mean by "node set order"? is that a priority field of the NodeSet CR?

How a deployment's replicas are expected to be scheduled onto the matching NodeSets? and is the scheduling directives a hard or soft constraint?

2、At the same time, it also supports the maximum number of schedulable pods per node set

Where is this max num defined?

fjding commented 1 year ago

It will help us understanding the motivation(s) if you can elaborate on the real-world use-cases.

1、Define multiple logical node set, a deployment workload can schedule pods according to this node set order, and shrink in the opposite order at the same time.

What do you mean by "node set order"? is that a priority field of the NodeSet CR?

How a deployment's replicas are expected to be scheduled onto the matching NodeSets? and is the scheduling directives a hard or soft constraint?

2、At the same time, it also supports the maximum number of schedulable pods per node set

Where is this max num defined?

Hi, Thank you for your attention！ The motivation： In cloud scenarios, some users prefer to use ECS first. When ECS is insufficient, they will consider using elastic containers, such as Alibaba Cloud's ECI. Because the cost of using ecs will be lower than the cost of eci. @KunWuLuan Can you add your usage scenarios?

We will define CRD named ResourcePoliy, it's CR instance as follows:

Because ecs-pool is ranked before eci-pool, pods will be scheduled to ecs-pool first. If the number of pods scheduled into ecs-pool exceeds 100, pods will be scheduled to eci-pool

KunWuLuan commented 1 year ago

In our company's scenario, customers will deploy both spot instances and pay-as-you-go instances simultaneously. Customers want their business to run on spot instances first to save costs, and when spot instance resources are insufficient, they will run on pay-as-you-go instances. Moreover, during business peak periods, when neither type of instance has resources, the business Pod will be scheduled to ECI nodes. In this case, they will deploy a resourcepolicy as follows:

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: xxx
  namespace: xxx
spec:
  selector:
    key1: value1
  strategy: prefer
  units:
  - resource: ecs
    nodeSelector:
      type: spot
  - resource: ecs
    nodeSelector:
      type: pay-as-you-go
  - resource: eci

Huang-Wei commented 1 year ago

It seems @KunWuLuan is talking about the Alibaba cloud's feature described here: https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/configure-priority-based-resource-scheduling. And @fjding is talking about a similar in-house implementation? (the design of maxReplicas is a bit strange though).

I'm open to host an abstracted version in scheduler-plugins.

BTW, not sure how you guys implement the node pool based preference, in scoring phase. My feeling is that to support it efficiently, we may need to bring some missing machinery to scheduler framework, you can check my comment in one of the sig-meeting: https://youtu.be/UhZBkFamoAg?t=1694

cc @denkensk

denkensk commented 1 year ago

It seems @KunWuLuan is talking about the Alibaba cloud's feature described here: https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/configure-priority-based-resource-scheduling. And @fjding is talking about a similar in-house implementation? (the design of maxReplicas is a bit strange though).

Hmmm. I know it. Actually, I am the author of this feature in Alibaba cloud 😄. It takes me a long time to think of this name ResourcePoliy 😄 @fjding Did you reference this implementation before?

denkensk commented 1 year ago

If the number of pods scheduled into ecs-pool exceeds 100, pods will be scheduled to eci-pool.

Can you introduce your scenario for this? And Why do you need to schedule 100 to ecs-pool first? @fjding

denkensk commented 1 year ago

BTW, not sure how you guys implement the node pool based preference, in scoring phase. My feeling is that to support it efficiently, we may need to bring some missing machinery to scheduler framework, you can check my comment in one of the sig-meeting: https://youtu.be/UhZBkFamoAg?t=1694

Your comment is very useful in a real production environment. And I also care about this efficiency and memory usage if we need to memorize some history or status before. @Huang-Wei

fjding commented 1 year ago

It seems @KunWuLuan is talking about the Alibaba cloud's feature described here: https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/configure-priority-based-resource-scheduling. And @fjding is talking about a similar in-house implementation? (the design of maxReplicas is a bit strange though).

I'm open to host an abstracted version in scheduler-plugins.

BTW, not sure how you guys implement the node pool based preference, in scoring phase. My feeling is that to support it efficiently, we may need to bring some missing machinery to scheduler framework, you can check my comment in one of the sig-meeting: https://youtu.be/UhZBkFamoAg?t=1694

cc @denkensk The proposal I provided is being used on ByteDance's Volcano Engine, and the design was inspired by Alibaba Cloud's implementation. However, I personally think that maxReplicas is very useful, as in the following scenarios. Users expect a Deployment's Pods to be distributed across different AZ (Available Zone) in a certain proportion.

A cluster has multiple AZs(Available Zones), and each AZ has a VK (virtual kubelet)，Users expect a Deployment's Pods to be distributed across different AZs in a certain proportion.

fjding commented 1 year ago

It seems @KunWuLuan is talking about the Alibaba cloud's feature described here: https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/configure-priority-based-resource-scheduling. And @fjding is talking about a similar in-house implementation? (the design of maxReplicas is a bit strange though).

Hmmm. I know it. Actually, I am the author of this feature in Alibaba cloud 😄. It takes me a long time to think of this name ResourcePoliy 😄 @fjding Did you reference this implementation before?

@denkensk Yes，The design was inspired by Alibaba Cloud's implementation，At the same time, some other functions were added

fjding commented 1 year ago

If the number of pods scheduled into ecs-pool exceeds 100, pods will be scheduled to eci-pool.

Can you introduce your scenario for this? And Why do you need to schedule 100 to ecs-pool first? @fjding

As I gave an example above, multi-AZ deployment is a good case， the openkruise also provides some cases.link

denkensk commented 1 year ago

A cluster has multiple AZs(Available Zones), and each AZ has a VK (virtual kubelet)，Users expect a Deployment's Pods to be distributed across different AZs in a certain proportion. https://www.volcengine.com/docs/6460/177068

Thanks for your explanation @fjding . And I'm also glad that these ideas can be applied to your scenario. And also scheduler-plugins can be used in ByteDance's Volcano Engine

denkensk commented 1 year ago

And I think we also need to clarify the core requirements. If you want to deploy the pods across different AZs, why use Max other than Must? Because according to my experience, users always want to make sure the proportion is required other than prefer. @fjding

@KunWuLuan Do you have feedback from other users or more needs for a "resource policy"? We can discuss it here and make a more generic design together.

fjding commented 1 year ago

@denkensk Users often use multi-AZ scenarios for disaster recovery purposes. In elastic container scenarios, such as ByteDance's VCI, users cannot accurately predict the upper limit of VCI capacity. Therefore, they cannot disable the launch of a pod just because resources in one AZ are unavailable.

fjding commented 1 year ago

And I think we also need to clarify the core requirements. If you want to deploy the pods across different AZs, why use Max other than Must? Because according to my experience, users always want to make sure the proportion is required other than prefer. @fjding

@KunWuLuan Do you have feedback from other users or more needs for a "resource policy"? We can discuss it here and make a more generic design together.

BTW, the strategy is required and maxReplica can meet the scenario you mentioned “Must".

KunWuLuan commented 1 year ago

Do you have feedback from other users or more needs for a "resource policy"?

In my cloud scene. Our users will use ResourcePolicy to run a fixed number of Pods on ECS nodes (like maxReplicas in this design) and schedule the Pods that are scaled out during peak periods to Spot instances or ECI .

fjding commented 1 year ago

@Huang-Wei @denkensk @ffromani After the above discussion, do you have any other questions? Can we now propose a complete KEP? cc @KunWuLuan

fjding commented 1 year ago

@Huang-Wei Hi, Are there any other issues with this proposal? If not, can we proceed with writing a KEPS document?

Huang-Wei commented 1 year ago

Sure, please go ahead to raise a KEP. We can continue the discussion in the KEP. Just keep in mind this repo is more focusing on the scheduling portion, and may leave the discussion of CRD spec details outside.

fjding commented 1 year ago

Sure, please go ahead to raise a KEP. We can continue the discussion in the KEP. Just keep in mind this repo is more focusing on the scheduling portion, and may leave the discussion of CRD spec details outside.

Thanks, @KunWuLuan we can do it together now

KunWuLuan commented 1 year ago

@fjding Hi, I have submit a draft for this feature.

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

KunWuLuan commented 9 months ago

/remove-lifecycle stale

KunWuLuan commented 8 months ago

This CRD is widely used in both my company and fjding's, we have selected the same features we meet for our customers in the proposal. So we think that the CRD that we described in the proposal is a stable version and it will not be updated frequently. Maybe we can host this CRD in scheduler-plugins instead of other place. HDYT? @fjding cc @ffromani @Huang-Wei

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

KunWuLuan commented 5 months ago

/remove-lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

KunWuLuan commented 2 months ago

/remove-lifecycle stale

KunWuLuan commented 5 days ago

/assign

kubernetes-sigs / scheduler-plugins

[proposal] add a schedule plugin that support pod expands and shrinks according to the order of the defined logical node set #475