volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.26k stars 975 forks source link

Deadlocked volcano jobs #3373

Open johnhe-dev opened 8 months ago

johnhe-dev commented 8 months ago

What happened: We are seeing deadlocked jobs in Volcano queuing. Lets say the cluster has 100 nodes available. Two Volcano jobs are submitted simultaneously. Each job requires 100 Pods. Each Pod must run on a separate node. The MinAvailable Pods for both jobs is 100.

We observe both jobs are in running state. But each job's ['status']['taskStatusCount']['x']['phase']['Running'] is less than 100. It means each job gets a subset of the available 100 nodes. As the MinAvailable Pods for both jobs is 100, neither of the two jobs can get all the MinAvailable Pods allocated.

What you expected to happen: Ideally, we want to allocate all 100 nodes to one job's PodGroup and keep the second job pending. That would have high cluster resource utilization.

How to reproduce it (as minimally and precisely as possible): Provision a K8s cluster with 100 available nodes. Create two Volcano jobs simultaneously. Each job requires 100 Pods. Each Pod must run on a separate node. The MinAvailable Pods for both jobs is 100.

Anything else we need to know?:

Volcano-scheduler.conf is as following:

actions: "enqueue, allocate, backfill" tiers:

Environment:

Monokaix commented 8 months ago

you mean the pod set node affinity?

Vacant2333 commented 8 months ago

can u try disable overcommit plugin

noobzzw commented 7 months ago

Hi, do you only have 100 nodes available in your cluster? Are there any other resources available, but the job has been set with affinity or node selector? Volcano does not consider affinity or node selector when in the "inqueue" state, so if there are other resources in the cluster but your job has set affinity or node selector, the situation described above may occur.

johnhe-dev commented 7 months ago

@noobzzw, thanks for commenting on the issue. 1/ Yes, in the error scenario, there are only 100 nodes available. Other nodes in the cluster are already running other jobs. 2/ Yes, I use node/pod affinity to ensure only 1 Pod running exclusively on 1 node.

To state more clearly, the issue is about gang scheduling. There are two independent job A and B in the queue. Each job requests 100 nodes. The MinAvailable for each job is set to 100. It means each job can run if and only if it is allocated with 100 nodes. I was expecting gang scheduler can respect MinAvailable requirement and allocate all 100 nodes available to 1 job. But this is not happening. Volcano allocates nodes to both jobs. Each job gets a subset of the free nodes and stuck in the queue.

Maybe my understanding about gang scheduling is wrong.