volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.19k stars 962 forks source link

Job pending even though the allocatable resource is enough #1297

Closed nicklhy closed 3 years ago

nicklhy commented 3 years ago

Is this a BUG REPORT or FEATURE REQUEST?: /kind bug

What happened: There are two nodes in my cluster: A and B. The current idle resources of them are 53 cpu + 476GB memory and 31 cpu + 96GB memory respectively. Now, I want to submit a distributed job with two tasks: X needs 30 cpu + 1GB memory, Y needs 50 cpu + 1GB memory. Theoretically, we can schedule task X on node B and task Y on node A. However, If I submit this job repeatedly (without any config change), sometimes it succeeded and sometimes it failed.

The log of volcano-scheduler shows that it first tried to bind task X on node A and then there is no place for task Y. It didn't consider the other choice, binding task X on node B.

What you expected to happen:

This job should always be scheduled successfully.

How to reproduce it (as minimally and precisely as possible):

Try to submit a task as above for multiple times.

Anything else we need to know?:

I looked into volcano's source code about the job scheduling policy here:

https://github.com/volcano-sh/volcano/blob/da8df888c412814d3574c501bc7c0b31a828b68e/pkg/scheduler/actions/allocate/allocate.go#L198-L203

It seems volcano tries to schedule each task one-by-one like a stack. I guess this should be the reason of this problem.

Environment:

ningfdx commented 3 years ago

+1

Thor-wl commented 3 years ago

Well, perhaps it's reasonable as what you think because you were thinking at the view of human beings, not a scheduler.When resolve the problem, you have set the confirmed input job set : job X and job Y. So you can try to find the best solution to match them to nodes. But in the view of scheduler like Volcano, it deals with all jobs in turn. It can only find the best match and allocation for the current job according to the current condition of the cluster regardless of jobs to be scheduled in the future, just like a workflow. It can not predicate what kinds of jobs will come in the future.

nicklhy commented 3 years ago

@Thor-wl Thanks for your reply. But I guess you misunderstood the problem here. I am not complaining about volcano failed to schedule two different jobs. Instead, I am complaining the scheduling policy of a single job with multiple pods, which is exactly the core problem (also known as gang-scheduling of a podgroup?) that volcano aims to solve.

I think it is better to schedule a job which includes multiple pods globally, rather than using a one-by-one greedy method.

PS. I used the term "task" because volcano use this word in the CRD: https://github.com/volcano-sh/volcano/blob/da8df888c412814d3574c501bc7c0b31a828b68e/pkg/apis/batch/v1alpha1/job.go#L64

ningfdx commented 3 years ago

Well, perhaps it's reasonable as what you think because you were thinking at the view of human beings, not a scheduler.When resolve the problem, you have set the confirmed input job set : job X and job Y. So you can try to find the best solution to match them to nodes. But in the view of scheduler like Volcano, it deals with all jobs in turn. It can only find the best match and allocation for the current job according to the current condition of the cluster regardless of jobs to be scheduled in the future, just like a workflow. It can not predicate what kinds of jobs will come in the future.

Yes, I understand the reason of design of one-by-one, but there is a difference in job and jobs. Scheduler should not predicate next job, so it work fine in the case of [job just include 1 task] , but if a batch job include 2 or more task, it would be a random box that we do not know what will happen, I think vlocano can use some algorithm like DP or recursion to improve the performence of that case

Thor-wl commented 3 years ago

@Thor-wl Thanks for your reply. But I guess you misunderstood the problem here. I am not complaining about volcano failed to schedule two different jobs. Instead, I am complaining the scheduling policy of a single job with multiple pods, which is exactly the core problem (also known as gang-scheduling of a podgroup?) that volcano aims to solve.

I think it is better to schedule a job which includes multiple pods globally, rather than using a one-by-one greedy method.

PS. I used the term "task" because volcano use this word in the CRD: https://github.com/volcano-sh/volcano/blob/da8df888c412814d3574c501bc7c0b31a828b68e/pkg/apis/batch/v1alpha1/job.go#L64

Oh, yes, I'm sorry for misunderstanding the question. It does an optimization point when scheduling a job with multiple pods. Do you have some good ideas or alogrithms for that?

Thor-wl commented 3 years ago

I think vlocano can use some algorithm like DP or recursion to improve the performence of that case

That's interesting! Can you submit a design doc for that?

ningfdx commented 3 years ago

@Thor-wl
I think DP is a suitable algorithm. This is a standard question of 0-1 Knapsack.

ningfdx commented 3 years ago

I think vlocano can use some algorithm like DP or recursion to improve the performence of that case

That's interesting! Can you submit a design doc for that?

Sure, I need some time to prepare that.

Thor-wl commented 3 years ago

I think vlocano can use some algorithm like DP or recursion to improve the performence of that case

That's interesting! Can you submit a design doc for that?

Sure, I need some time to prepare that.

Great! I'm looking forward to that.

k82cn commented 3 years ago

That's a NP problem for multi-dimension resources; anyway, that's great to have some algorithm to improve it.

ningfdx commented 3 years ago

That's a NP problem for multi-dimension resource; Yes, you are correct, it's a NP problem, more complicated than multi knapsack problem. Past days I tried to solve this problem through DP algorithm, but failed.

Volcano has some logic such as nodeorder, priority. I think if we want to solve this problem in polynomial time, we couldn't calculate every probability.

So we can redifine the problem using the current policy as step1. When a batch task failed to schedule, we can use a greedy policy (use greedy algorithm to find a feasible solution, but not necessarily optimal) as step2. That means schedule failed = both step1 and step2 failed. And we can set this greedy algorithm as a plugin logic. @k82cn, @Thor-wl , Do you think my opnion is feasible?

k82cn commented 3 years ago

that's ok to me :)

ningfdx commented 3 years ago

ok, I'll write code flow the policy above.

k82cn commented 3 years ago

ok, I'll write code flow the policy above.

Would you help to draft a design doc for that? That'll help on the discussion :)

ningfdx commented 3 years ago

ok, I'll write code flow the policy above.

Would you help to draft a design doc for that? That'll help on the discussion :)

Ok, No problem.

Thor-wl commented 3 years ago

Any process for this feature and design? @ningfdx

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] commented 3 years ago

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

ZDWWWWW commented 7 months ago

I also need that QAQ