kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
419 stars 210 forks source link

pod priority was assigned to 0 though the priorityclassname of the podgroup had been assigned #592

Closed Robin7831 closed 10 months ago

Robin7831 commented 10 months ago

Hi everybody, I'm testing mpi-operator versioned 4.0.0 recently, I found the mpijob was supported to be scheduled by volcano, so I modify the deployment of mpi-operator and spec of my mpijob yaml following https://github.com/kubeflow/website/pull/3453/files, however, though the podgroup was successfully created and the priorityclassname correctly assigned, pods' priority are 0s.

  1. create the priorityclass:
    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
    name: high
    value: 1000
    globalDefault: false
    description: "A high-priority class for important Pods."
    preemptionPolicy: PreemptLowerPriority
  2. check the priorityclass
    NAME                      VALUE        GLOBAL-DEFAULT   AGE
    high                      1000         false            5h28m
    system-cluster-critical   2000000000   false            510d
    system-node-critical      2000001000   false            510d
  3. check the deployment of mpi-operator:
    spec:
      containers:
      - args:
        - --gang-scheduling=volcano
        - -alsologtostderr
        - --lock-namespace=mpi-operator
        image: myharbor/common/mpi-operator:0.4.0
  4. create the mpijob
    apiVersion: kubeflow.org/v2beta1
    kind: MPIJob
    metadata:
    name: mpitest-helloworld
    namespace: mpi-operator
    spec:
    slotsPerWorker: 2
    runPolicy:
    cleanPodPolicy: Running
    schedulingPolicy:
      minAvailable: 1
      minResources:
        cpu: "4"
        memory: 16Gi
      priorityClass: high
      scheduleTimeoutSeconds: 300
    mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
         spec:
           containers:
           - image: myharbor/common/mpi-base:testv0
             imagePullPolicy: Always
             name: hellompi-launcher
             command:
             - sleep
             - infinity
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: myharbor/common/mpi-base:testv0
            imagePullPolicy: Always
            name: hellompi-worker
            resources:
              requests:
                cpu: "2"
                memory: 8Gi
              limits:
                cpu: "2"
                memory: 8Gi
  5. check the podgroup
    ---
    apiVersion: scheduling.volcano.sh/v1beta1
    kind: PodGroup
    metadata:
    name: mpitest-helloworld
    namespace: mpi-operator
    ownerReferences:
    - apiVersion: kubeflow.org/v2beta1
      blockOwnerDeletion: true
      controller: true
      kind: MPIJob
      name: mpitest-helloworld
      uid: 23f17ad5-0f50-43cf-9a4d-0795a6cacee8
    resourceVersion: '115433952'
    spec:
    minMember: 1
    minResources:
    cpu: '4'
    memory: 16Gi
    nvidia.com/gpu: '1'
    priorityClassName: high
    status:
    conditions:
    - lastTransitionTime: '2023-09-07T07:04:35Z'
      reason: tasks in gang are ready to be scheduled
      status: 'True'
      transitionID: fefde024-6bc4-418a-b3b1-6b571beea9a3
      type: Scheduled
    phase: Running
    running: 2
  6. however the priority of pods are 0s
    [root@k8s-master1 ]# kubectl describe po -n mpi-operator mpitest-helloworld-launcher-59bqp
    Name:         mpitest-helloworld-launcher-59bqp
    Namespace:    mpi-operator
    Priority:     0
tenzen-y commented 10 months ago

The mpi-operator passes the .spec.runPolicy.priorityClass in the MPIJob only to the PodGroup resource and doesn't pass the priorityClass to the pod. So, the behavior is expected. If you want to set priorityClass to the pod, you must set the direct the priorityClass to the pods.

Robin7831 commented 10 months ago

thx a lot! I missed the point

The mpi-operator passes the .spec.runPolicy.priorityClass in the MPIJob only to the PodGroup resource and doesn't pass the priorityClass to the pod. So, the behavior is expected. If you want to set priorityClass to the pod, you must set the direct the priorityClass to the pods.

I do miss the point, and I just managed to it. Thx a lot !!!