volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.07k stars 940 forks source link

reclaim action's evict can not be canceled #3673

Open lowang-bh opened 1 month ago

lowang-bh commented 1 month ago

Description

Reclaim action use the ssn.Evict, which directly evict pod and can not be caceled when eviction is not helpful.

image image

Steps to reproduce the issue

cluster with cpu = 3 and scheduler configmap is

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, reclaim"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
  1. kubectl apply -f job-a.yaml
    kind: Job
    metadata:
    name: job-a
    spec:
    backoffLimit: 3
    completions: 3
    parallelism: 3
    template:
    metadata:
      annotations:
        scheduling.k8s.io/group-name: job-a-pg
        volcano.sh/preemptable: "true"
    spec:
      containers:
      - image: nginx:1.14.2
        imagePullPolicy: IfNotPresent
        name: nginx
        ports:
          - containerPort: 80
        resources:
          requests:
            cpu: 1000m
            memory: 200Mi
          limits:
            cpu: 1000m
            memory: 200Mi
      restartPolicy: Never
      terminationGracePeriodSeconds: 1
      schedulerName: volcano
    ---
    apiVersion: scheduling.volcano.sh/v1beta1
    kind: PodGroup
    metadata:
    annotations:
    scheduling.k8s.io/reclaimable: "true"
    name: job-a-pg
    namespace: default
    spec:
    minMember: 1
    queue: queue-a
  2. kubectl apply -f job-a.yaml ➜ reclaim git:(master) ✗ cat job-b.yaml
    apiVersion: batch/v1
    kind: Job
    metadata:
    name: job-b
    spec:
    backoffLimit: 2
    completions: 2
    parallelism: 2
    template:
    metadata:
      annotations:
        scheduling.k8s.io/group-name: job-b-pg
        volcano.sh/preemptable: "true"
    spec:
      containers:
      - image: nginx:1.14.2
        imagePullPolicy: IfNotPresent
        name: nginx
        ports:
          - containerPort: 80
        resources:
          requests:
            cpu: 2000m
            memory: 200Mi
          limits:
            cpu: 2000m
            memory: 200Mi
      restartPolicy: Never
      terminationGracePeriodSeconds: 1
      schedulerName: volcano
    ---
    apiVersion: scheduling.volcano.sh/v1beta1
    kind: PodGroup
    metadata:
    annotations:
    scheduling.k8s.io/reclaimable: "true"
    name: job-b-pg
    namespace: default
    spec:
    minMember: 2
    queue: queue-b

Describe the results you received and expected

Reclaim action should use the Statement.Evict and can be caceled with Statement.Discard

What version of Volcano are you using?

master

Any other relevant information

No response

Monokaix commented 1 month ago

You mean the gang can't be met? I think the gang plugin should returen no victims if it's not met.

lowang-bh commented 1 month ago

We need to use statement.Evict instead of ssn.Evict. ssn package doesn't support transaction。

lowang-bh commented 4 weeks ago

I think the gang plugin should returen no victims if it's not met.

That's victim job will return nil if its gang cannot be met. The reclaimor job's gang is not met and should revert.