volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.27k stars 977 forks source link

Resource Reclaim Between Different Queues #3842

Open barrycheng05 opened 5 days ago

barrycheng05 commented 5 days ago

Please describe your problem in detail

I am trying to test the effect of queue deserved with the reclaim action, but job-b remains in the Pending state.

The queue and job YAML configurations were modified based on this [Issue](https://github.com/volcano-sh/volcano/issues/3729).

Here is part of the volcano-scheduler log. Could you please help me understand why the reclaim process is not triggered?

I1126 09:42:58.038748       1 reclaim.go:40] Enter Reclaim ...
I1126 09:42:58.038752       1 reclaim.go:49] There are <2> Jobs and <3> Queues in total for scheduling.
I1126 09:42:58.038757       1 reclaim.go:67] Added Queue <first> for Job <default/job-a-c035b82a-7643-4d96-ad70-476de2489dd6>
I1126 09:42:58.038763       1 capacity.go:255] Queue <first> can not reclaim, deserved <cpu 20000.00, memory 262144000.00>, allocated <cpu 40000.00, memory 262144000.00, pods 5.00>, share <2>
I1126 09:42:58.038775       1 reclaim.go:99] Queue <first> can not reclaim by preempt others, ignore it.
I1126 09:42:58.038783       1 reclaim.go:220] Leaving Reclaim ...
I1126 09:42:58.038790       1 backfill.go:44] Enter Backfill ...
I1126 09:42:58.038796       1 backfill.go:110] Leaving Backfill ...
I1126 09:42:58.038927       1 session.go:214] Queue <first> allocated resource keeps equal, no need to update queue status <map[cpu:{{40000 -3} {<nil>}  DecimalSI} memory:{{262144000 0} {<nil>}  BinarySI} pods:{{5 0} {<nil>}  DecimalSI}]>.
I1126 09:42:58.038966       1 session.go:214] Queue <second> allocated resource keeps equal, no need to update queue status <map[cpu:{{0 -3} {<nil>}  DecimalSI} memory:{{0 0} {<nil>}  BinarySI}]>.
I1126 09:42:58.038975       1 session.go:214] Queue <default> allocated resource keeps equal, no need to update queue status <map[cpu:{{0 -3} {<nil>}  DecimalSI} memory:{{0 0} {<nil>}  BinarySI}]>.
I1126 09:42:58.038982       1 session.go:244] Close Session fc38d2d3-695d-4f12-ad1b-bd1c3b50e5c6
I1126 09:42:58.038992       1 scheduler.go:129] End scheduling ...

Below are the related YAML configurations. If additional information is required, I can provide it.

Thank you.


scheduler-config.yaml

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, reclaim, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: capacity
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system

queue.yaml

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: first
spec:
  reclaimable: true
  deserved:
    cpu: 20
    memory: 2Gi
  capability:
    cpu: 40
    memory: 2Gi
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: second
spec:
  reclaimable: true
  deserved:
    cpu: 20
    memory: 2Gi
  capability:
    cpu: 40
    memory: 2Gi

job-a.yaml

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-a
spec:
  schedulerName: volcano
  queue: first
  minAvailable: 2
  tasks:
    - replicas: 5
      name: "master"
      template:
        metadata:
          annotations:
            volcano.sh/preemptable: "true"
        spec:
          containers:
            - image: nginx:1.14.2
              name: nginx
              resources:
                requests:
                  cpu: "8"
                  memory: "50Mi"
          restartPolicy: OnFailure

job-b.yaml

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-b
spec:
  schedulerName: volcano
  queue: second
  minAvailable: 2
  tasks:
    - replicas: 5
      name: "worker"
      template:
        metadata:
          annotations:
            volcano.sh/preemptable: "true"
        spec:
          containers:
            - image: nginx:1.14.2
              name: nginx
              resources:
                requests:
                  cpu: "8"
                  memory: "50Mi"
          restartPolicy: OnFailure

Current Status

The cluster has approximately 48 cores, and other running Pods are using around 5 cores.

$ kubectl apply -f queue.yaml
queue.scheduling.volcano.sh/first created
queue.scheduling.volcano.sh/second created

$ kubectl apply -f job-a.yaml
job.batch.volcano.sh/job-a created

$ kubectl apply -f job-b.yaml
job.batch.volcano.sh/job-b created

$ kubectl get po
NAME             READY   STATUS    RESTARTS   AGE
job-a-master-0   1/1     Running   0          25s
job-a-master-1   1/1     Running   0          25s
job-a-master-2   1/1     Running   0          25s
job-a-master-3   1/1     Running   0          25s
job-a-master-4   1/1     Running   0          25s

$ kubectl get vcjob
NAME    STATUS    MINAVAILABLE   RUNNINGS   AGE
job-a   Running   2              5          42s
job-b   Pending   2                         38s

Any other relevant information

No response

PigNatovsky commented 5 days ago

I think that there is something more - looks like job 2 has other issues. Look at the share of first queue - it's 2. It means, that it consumes twice as much, as it deserves. Have You checked events?

PigNatovsky commented 5 days ago

I mean that job is in pending status, there is no pod scheduled and waiting for resources. I think that it's more related to controller, not to the scheduler.

barrycheng05 commented 5 days ago

Is the "Event" referring to job-b?

$ kubectl describe vcjob job-b
......
Status:
  Conditions:
    Last Transition Time:  2024-11-26T09:39:23Z
    Status:                Pending
  Min Available:           2
  State:
    Last Transition Time:  2024-11-26T09:39:23Z
    Phase:                 Pending
Events:
  Type     Reason           Age   From                   Message
  ----     ------           ----  ----                   -------
  Warning  PodGroupPending  15h   vc-controller-manager  PodGroup default:job-b unschedule,reason: 2/0 tasks in gang unschedulable: pod group is not ready, 2 minAvailable

When I remove job-a, job-b can be successfully created.

$ kubectl get vcjob
NAME    STATUS    MINAVAILABLE   RUNNINGS   AGE
job-a   Running   2              5          15h
job-b   Pending   2                         15h

$ kubectl delete vcjob job-a
job.batch.volcano.sh "job-a" deleted

$ kubectl get vcjob
NAME    STATUS    MINAVAILABLE   RUNNINGS   AGE
job-b   Pending   2                         15h

$ kubectl get vcjob
NAME    STATUS    MINAVAILABLE   RUNNINGS   AGE
job-b   Running   2              5          15h

By the way, I'm using Volcano Scheduler version 1.9.0. Thanks for your reply.

barrycheng05 commented 4 days ago

I noticed this log later, and it seems like the overcommit plugin is kicking job-b out of the queue. The expectation was that once a job enters the Pending state, it shouldn’t be considered for preemption. After I removed the overcommit plugin, job-b was able to allocate normally.

I1127 08:26:56.828424       1 enqueue.go:45] Enter Enqueue ...
I1127 08:26:56.828429       1 enqueue.go:63] Added Queue <second> for Job <default/job-b-7a171232-8367-4d99-b301-233e98264f25>
I1127 08:26:56.828438       1 enqueue.go:74] Added Job <default/job-b-7a171232-8367-4d99-b301-233e98264f25> into Queue <second>
I1127 08:26:56.828442       1 enqueue.go:63] Added Queue <first> for Job <default/job-a-b60497e4-2892-4687-929d-5284e94a8871>
I1127 08:26:56.828449       1 enqueue.go:79] Try to enqueue PodGroup to 1 Queues
I1127 08:26:56.828459       1 overcommit.go:128] Resource in cluster is overused, reject job <default/job-b-7a171232-8367-4d99-b301-233e98264f25> to be inqueue
I1127 08:26:56.828483       1 enqueue.go:104] Leaving Enqueue ...