Open barrycheng05 opened 5 days ago
I think that there is something more - looks like job 2 has other issues. Look at the share of first queue - it's 2. It means, that it consumes twice as much, as it deserves. Have You checked events?
I mean that job is in pending status, there is no pod scheduled and waiting for resources. I think that it's more related to controller, not to the scheduler.
Is the "Event" referring to job-b
?
$ kubectl describe vcjob job-b
......
Status:
Conditions:
Last Transition Time: 2024-11-26T09:39:23Z
Status: Pending
Min Available: 2
State:
Last Transition Time: 2024-11-26T09:39:23Z
Phase: Pending
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning PodGroupPending 15h vc-controller-manager PodGroup default:job-b unschedule,reason: 2/0 tasks in gang unschedulable: pod group is not ready, 2 minAvailable
When I remove job-a
, job-b
can be successfully created.
$ kubectl get vcjob
NAME STATUS MINAVAILABLE RUNNINGS AGE
job-a Running 2 5 15h
job-b Pending 2 15h
$ kubectl delete vcjob job-a
job.batch.volcano.sh "job-a" deleted
$ kubectl get vcjob
NAME STATUS MINAVAILABLE RUNNINGS AGE
job-b Pending 2 15h
$ kubectl get vcjob
NAME STATUS MINAVAILABLE RUNNINGS AGE
job-b Running 2 5 15h
By the way, I'm using Volcano Scheduler version 1.9.0. Thanks for your reply.
I noticed this log later, and it seems like the overcommit plugin is kicking job-b out of the queue. The expectation was that once a job enters the Pending state, it shouldn’t be considered for preemption. After I removed the overcommit
plugin, job-b was able to allocate normally.
I1127 08:26:56.828424 1 enqueue.go:45] Enter Enqueue ...
I1127 08:26:56.828429 1 enqueue.go:63] Added Queue <second> for Job <default/job-b-7a171232-8367-4d99-b301-233e98264f25>
I1127 08:26:56.828438 1 enqueue.go:74] Added Job <default/job-b-7a171232-8367-4d99-b301-233e98264f25> into Queue <second>
I1127 08:26:56.828442 1 enqueue.go:63] Added Queue <first> for Job <default/job-a-b60497e4-2892-4687-929d-5284e94a8871>
I1127 08:26:56.828449 1 enqueue.go:79] Try to enqueue PodGroup to 1 Queues
I1127 08:26:56.828459 1 overcommit.go:128] Resource in cluster is overused, reject job <default/job-b-7a171232-8367-4d99-b301-233e98264f25> to be inqueue
I1127 08:26:56.828483 1 enqueue.go:104] Leaving Enqueue ...
Please describe your problem in detail
I am trying to test the effect of
queue deserved
with thereclaim
action, butjob-b
remains in the Pending state.The queue and job YAML configurations were modified based on this [Issue](https://github.com/volcano-sh/volcano/issues/3729).
Here is part of the
volcano-scheduler
log. Could you please help me understand why the reclaim process is not triggered?Below are the related YAML configurations. If additional information is required, I can provide it.
Thank you.
scheduler-config.yaml
queue.yaml
job-a.yaml
job-b.yaml
Current Status
The cluster has approximately 48 cores, and other running Pods are using around 5 cores.
Any other relevant information
No response