kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.36k stars 243 forks source link

Add MaxWaitingTime in case of job starvation of lower priority #754

Open kerthcet opened 1 year ago

kerthcet commented 1 year ago

We enqueue the jobs via the priorities mainly, which will cause job starvation of lower priorities,we can provide a mechanism to avoid this.

What would you like to be added:

A new field MaxWaitingTime to avoid job starvation.

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

The artifacts should be linked in subsequent comments.

alculquicondor commented 1 year ago

what would happen when the workload hits the maxWaitingTime?

kerthcet commented 1 year ago

It will be popped out for scheduling whatever the priority.

alculquicondor commented 1 year ago

You mean it will go to the head of the ClusterQueue?

It seems a bit aggressive. Another way this can be modeled is that the priority actually goes up. Another important question is at which object this is controlled. Workload? ClusterQueue? global?

And how to prevent abuse?

trasc commented 1 year ago

Only moving the workload to the head of the queue will be pointless when preemption is enabled, it will just be evicted at the next scheduler cycle.

tenzen-y commented 1 year ago

It seems a bit aggressive. Another way this can be modeled is that the priority actually goes up.

I like this.

Another important question is at which object this is controlled. Workload? ClusterQueue? global?

Maybe, ClusterQueue? If Workload controls this, it means batch users can modify this.

kerthcet commented 1 year ago

It seems a bit aggressive.

Yes, a bit aggressive but straightforward if some workloads have the deadline to run. But I'm not quite sure about this, I think we can wait for more feedbacks from the community.

Another way this can be modeled is that the priority actually goes up.

But how much to go up each time?

Another important question is at which object this is controlled. Workload? ClusterQueue? global?

localQueue or clusterQueue both make sense to me.

And how to prevent abuse?

I think it's hard to answer, this is configurable but considering localQueue should be created by admin, we can be optimistic about this?

kerthcet commented 1 year ago

Only moving the workload to the head of the queue will be pointless when preemption is enabled, it will just be evicted at the next scheduler cycle.

~We're not talking about preemption, just in case low priority jobs are starved by continuous high priority jobs.~

If resources are insufficient, this is a problem.

Gekko0114 commented 1 year ago

Hi, It looks interesting, can I work on this issue? If it is ok, I will clarify how to implement it later.

kerthcet commented 1 year ago

Yes, you can take it if you like, but we haven't reached an agreement about the implementation, like a maxWaitingTime, or increate the priority per scheduling cycle, or something else. And we should also consider the preemption.

Gekko0114 commented 1 year ago

Sure, thanks! I will read this thread in detail and clarify it. If I get an agreement, I will assign it to me.

alculquicondor commented 1 year ago

I would also consider adding an interface somewhere that allows us to implement different policies for ordering/priorities.

If you have a concrete use case, that would help us understand what the API can look like.

Gekko0114 commented 1 year ago

Hi,

I did some research this weekend and summarized my proposal. Could you take a look?


I am considering adding the following fields to clusterQueue.spec: labels: Users can change the priority of specific jobs by setting labels. maxWaitingTime: Set maxWaitingTime for timeout. strategy: Set the action for handling timeouts in strategy field. You can choose "First" or "Prioritize" options (it might be good for adding "custom" option and preparing interface for customizing logic by developers, but I don't consider it now.) addingPriority: If the "Prioritize" is chosen, set addingPriority. This value is added to each job's priority when timeout happens.

Use cases: scenarios like raising the priority only for prod jobs or executing recommendation batches within 3 hours etc

What do you think? As discussed, since it is set for clusterQueue, developers wouldn't abuse this.

spec:
  timeoutStrategy:
      labels:
         environment: prod
         system: recommendation         
      maxWaitingTime: 1hour
      strategy: Prioritize
      addingPriority: 10  
alculquicondor commented 1 year ago

Maybe we should consider adding a separate object WorkloadPriorityClass. Then workloads refer to a priority class that have a base priority and a "dynamicPolicy".

  basePriority: 100
  dynamicPolicy:
    delay: 1h
    action: PriorityDelta
    value: 10

Open questions:

Definitely worth a KEP: https://github.com/kubernetes-sigs/kueue/tree/main/keps

Gekko0114 commented 1 year ago

Hi @alculquicondor,

Thanks for your response.

I have a question. Why do you think it's better to add a separate object WorkloadPriorityClass rather than adding a new field to clusterQueue? Is it because it would be easier for batch users to manage priority by choosing a PriorityClass for each job?

what could another action be?

I think we should consider implementing another action going to the head of the ClusterQueue if it times out, which was the initial idea @kerthcet mentioned.

how to keep record of the last time the priority was increased? Do we actually update the priority in the object or we just track in memory?

I don't have a strong opinion on this, but I think it's better to update the priority in the object. If so, admins can easily change the priority manually.

Definitely worth a KEP:

I agree. I will create it.

alculquicondor commented 1 year ago

Why do you think it's better to add a separate object WorkloadPriorityClass rather than adding a new field to clusterQueue? Is it because it would be easier for batch users to manage priority by choosing a PriorityClass for each job?

Yes, I don't think it's realistic that all jobs in a ClusterQueue would have the same behavior.

I think we should consider implementing another action going to the head of the ClusterQueue if it times out, which was the initial idea @kerthcet mentioned.

I'm not too sure. Because this workload could be immediately preempted in the next iteration. It's better if it's priority changes.

I don't have a strong opinion on this, but I think it's better to update the priority in the object. If so, admins can easily change the priority manually.

Yes, but also to my point above about not being easily preempted. Another thing to think about is how do we know when the priority was bumped for the last time. In memory is not enough, because the controller could be restarted.

Can you look around in other queueing systems if they have dynamic priority and how it works?

Gekko0114 commented 1 year ago

Thanks for your explanation! @alculquicondor

I'm not too sure. Because this workload could be immediately preempted in the next iteration. It's better if it's priority changes. Yes, but also to my point above about not being easily preempted.

I have one question about preemption. If we update the priority in the object and a pod starts running, my understanding is that the pod would not be easily preempted since it would have a higher priority. Is that correct?

Another thing to think about is how do we know when the priority was bumped for the last time. In memory is not enough, because the controller could be restarted.

I agree. It's important.

Can you look around in other queueing systems if they have dynamic priority and how it works?

Sure. I will look around and rethink the design based on your explanation.

alculquicondor commented 1 year ago

If we update the priority in the object and a pod starts running, my understanding is that the pod would not be easily preempted since it would have a higher priority. Is that correct?

Exactly. That's probably desired? Which means we should probably add a maxPriority field in the priorityClass. So we have base, step, delayForStep, and max fields (exact names TBD).

Gekko0114 commented 1 year ago

Sounds good!

I wasn't quite clear on the intention behind step and maxPriority.

So, if I understand correctly, base represents the priority when the job is initially created, delayForStep indicates the time (e.g., 1 hour) to wait before changing the priority after a timeout. step determines the adding priority to set when a timeout occurs, and maxPriority is the priority used to override and prevent preemption of actively running pods, is that correct?

alculquicondor commented 1 year ago

Almost there. So after 1h, we have p = base + step. What happens after 2h? My thinking is that p = base + step * 2. Then we need a maxPriority, as we don't want it to go to infinity. By default we can probably cap it at MAX_INT32.

Gekko0114 commented 1 year ago

I see. Thanks.

Then I will create KEP (and also investigate on other queue systems). If there's anything else we should discuss, please leave comments.

/assign

alculquicondor commented 1 year ago

I created a simplified version of this feature request #973

kerthcet commented 9 months ago

I think we can leverage the workload priority to mitigate this issue. The general idea would like when waiting time ends, we'll raise the workload priority to a higher value and job will enqueue. Next scheduling cycle, the job will not be preempted as designed.

cc @B1F030

alculquicondor commented 9 months ago

More generally, we will be needing a policy for dynamic priority. Another use case that popped out is to decrease priority based on evictions https://github.com/kubernetes-sigs/kueue/pull/1311#issuecomment-1870531528

kerthcet commented 9 months ago

For this case, maybe we can be more straightforward here like:

  1. Set the maxWaitingTime in localQueue/clusterQueue/Job
  2. When hit the deadline, raise the priority to the top of the queue

Intuitive and easy to control the job queueing.

B1F030 commented 9 months ago

Here I drawed a simple design of this mechanism:

image

The prototype represents that there is a basePriority, when this workload hits the maxWaitingTime, we just simply change its priority to MAX_INT32.

And based on the prototype, here I have another design: First we use the workload's priority as basePriority, and after delayTime(for example, 30m), add its priority by step(100), then after another delayForStep(10m), add step again. Until the workload hits the maxWaitingTime, its priority will be add up straight to the maxPriority(MAX_INT32).

But the second design may be a little complex, and I think we should not expose the step and delayForStep to user in alpha version. We can define it properly, provide a default policy, that will be easy to use.

What do you think?

alculquicondor commented 9 months ago

Let's take a step back... do you have a real requirement from an end-user that you can share? What is the behavior they expect, regardless of how the API looks like.

To me, the idea of a Job getting the highest priority after some timeout sounds odd. Is the job really the most important of all? Additionally, this implies that the job is potentially never going to be preempted. Is this acceptable? Have you looked into older systems for inspiration?

kerthcet commented 9 months ago

The goal is to prevent job starvation, which is also part of fair-sharing, low priority job will pending in the queue forever if higher priority jobs enqueueing continuously.

Volcano has similar design https://github.com/volcano-sh/volcano/blob/master/docs/design/sla-plugin.md, however, the mechanism is when hitting the max waiting time, it will admit the queue and try to reserve resources for it until ready.

Some of our users are using Volcano, we hope to provide smooth migrations for them. At the least, I think this demand still sounds reasonable, hope I'm not the only one. But we can think more about the API design.

this implies that the job is potentially never going to be preempted

This is tricky... but we're somehow a preemption based system. Seems no better ways, I mean this design https://github.com/kubernetes-sigs/kueue/issues/754#issuecomment-1874899525.

kerthcet commented 9 months ago

But the second design may be a little complex, and I think we should not expose the step and delayForStep to user in alpha version. We can define it properly, provide a default policy, that will be easy to use.

What do you think?

Complex + 1 .., let's discuss the rationality first.

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kerthcet commented 6 months ago

/lifecycle froze

kerthcet commented 6 months ago

/lifecycle frozen