Open johnhe-dev opened 8 months ago
you mean the pod set node affinity?
can u try disable overcommit
plugin
Hi, do you only have 100 nodes available in your cluster? Are there any other resources available, but the job has been set with affinity or node selector? Volcano does not consider affinity or node selector when in the "inqueue" state, so if there are other resources in the cluster but your job has set affinity or node selector, the situation described above may occur.
@noobzzw, thanks for commenting on the issue. 1/ Yes, in the error scenario, there are only 100 nodes available. Other nodes in the cluster are already running other jobs. 2/ Yes, I use node/pod affinity to ensure only 1 Pod running exclusively on 1 node.
To state more clearly, the issue is about gang scheduling. There are two independent job A and B in the queue. Each job requests 100 nodes. The MinAvailable for each job is set to 100. It means each job can run if and only if it is allocated with 100 nodes. I was expecting gang scheduler can respect MinAvailable requirement and allocate all 100 nodes available to 1 job. But this is not happening. Volcano allocates nodes to both jobs. Each job gets a subset of the free nodes and stuck in the queue.
Maybe my understanding about gang scheduling is wrong.
What happened: We are seeing deadlocked jobs in Volcano queuing. Lets say the cluster has 100 nodes available. Two Volcano jobs are submitted simultaneously. Each job requires 100 Pods. Each Pod must run on a separate node. The MinAvailable Pods for both jobs is 100.
We observe both jobs are in running state. But each job's ['status']['taskStatusCount']['x']['phase']['Running'] is less than 100. It means each job gets a subset of the available 100 nodes. As the MinAvailable Pods for both jobs is 100, neither of the two jobs can get all the MinAvailable Pods allocated.
What you expected to happen: Ideally, we want to allocate all 100 nodes to one job's PodGroup and keep the second job pending. That would have high cluster resource utilization.
How to reproduce it (as minimally and precisely as possible): Provision a K8s cluster with 100 available nodes. Create two Volcano jobs simultaneously. Each job requires 100 Pods. Each Pod must run on a separate node. The MinAvailable Pods for both jobs is 100.
Anything else we need to know?:
Volcano-scheduler.conf is as following:
actions: "enqueue, allocate, backfill" tiers:
Environment:
kubectl version
): 1.27uname -a
):