volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.2k stars 963 forks source link

Volcano combines Launcher and Worker Resource Limits for Fitting #2101

Closed vincent-du2020 closed 2 years ago

vincent-du2020 commented 2 years ago

What happened: When we run a MPI Job like this, notice the Launcher and Worker have different "Resources/Requests:"

apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
  name: forty-1
  namespace: mpi-test
spec:
  slotsPerWorker: 4
  cleanPodPolicy: Running
  backoffLimit: 0
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
    ...
              resources:
                requests:
                  cpu: "100m"
    Worker:
      replicas: 1
      template:
        spec:
   ...
              resources:
                limits:
                  My-Resource: 4
                  hugepages-2Mi: "1800Mi"
                  cpu: "108"
              volumeMounts:
...

The Launcher Pod is always in the "Pending" state due to the resource fit failure, since the "My-Resource" only exists in the Nodes tainted for Worker Pods.

These are logs from 'volcano-scheduler' Pod

I0316 04:26:28.036649       1 allocate.go:241] Binding Task <mpi-test/forty-1-launcher--1-rsd4s> to node <a1>
I0316 04:26:28.036661       1 statement.go:263] After allocated Task <helm-mpi-test/forty-1-launcher--1-rsd4s> to Node <a1>: idle <cpu 110600.00, memory 804966006784.00, hugepages-1Gi 0.00, hugepages-2Mi 3510632448000.00>, used <cpu 1200.00, memory 1095872512.00>, releasing <cpu 0.00, memory
 0.00>
I0316 04:26:28.036707       1 statement.go:281] Allocating operations ...
I0316 04:26:28.036717       1 proportion.go:299] Queue <default>: deserved <cpu 108100.00, memory 536870912.00, hugepages-2Mi 1887436800000.00, My-Resource 4000.00>, allocated <cpu 100.00, memory 536870912.00>, share <1>, underUsedResName [cpu habana.ai/gaudi hugepages-2Mi]
I0316 04:26:28.036736       1 allocate.go:212] There are <3> nodes for Job <helm-mpi-test/forty-1>
I0316 04:26:28.036751       1 predicate_helper.go:73] Predicates failed for task <helm-mpi-test/forty-1-worker-0> on node <a1u39m0>: task helm-mpi-test/forty-1-worker-0 on node a1 fit failed: node(s) resource fit failed
I0316 04:26:28.036761       1 statement.go:347] Discarding operations ...

It seems the log statement in "proportion.go:299" does the calculation as this:

resource-request-from-worker * 1000 + resource-request-from-launcher then combine.

What you expected to happen: Launcher and Worker Pod should have separate resource requests. We had another cluster with an older version of Volcano (v1.1.2), this reported issue could not be reproduced.

How to reproduce it (as minimally and precisely as possible): set different resource requests in Launcher and Worker Specs, the Resource Request in the Worker gets applied to the Launcher Pod as well.

Anything else we need to know?: The actual hardware resource name is replaced with "My-Resource" here. We removed the "My-Resouce" Request from the Worker, both Launcher and Worker could be spawned on the node that is dedicated for the Launcher, and since this node does not have "My-Resource" the actual jobs on worker Pod would error out.

Environment:

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] commented 2 years ago

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗