volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.23k stars 968 forks source link

Volcano Torchx job scaling issue on GKE with #3311

Open obeyda opened 9 months ago

obeyda commented 9 months ago

We are trying to use Volcano to run Torchx jobs in a GKE cluster, but none of the jobs are triggering a scale up on our GPU nodepools.

If we manually scale up the targeted GPU nodepool the jobs get scheduled as desired, but we can't afford to keep the nodes up all the time so we want the volcano jobs to be able to trigger the automatic scale up.

The Defaut Queue ``` apiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: generation: 11 name: default status: allocated: cpu: '0' memory: '0' pending: 1 reservation: {} state: Open spec: capability: cpu: '500' memory: 600Gi nvidia.com/gpu: '40' guarantee: {} reclaimable: true weight: 1 ```
The volcano job ``` apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: myvcjob-jl7rjr9wqc35x namespace: myapp-2681 status: conditions: - lastTransitionTime: '2024-01-23T19:59:03Z' status: Pending - lastTransitionTime: '2024-01-23T20:00:56Z' status: Running - lastTransitionTime: '2024-01-23T20:27:44Z' status: Failed minAvailable: 1 runningDuration: 42m13.924661549s state: lastTransitionTime: '2024-01-23T20:27:44Z' phase: Failed version: 4 spec: maxRetry: 3 minAvailable: 1 plugins: env: [] svc: - '--publish-not-ready-addresses' queue: default schedulerName: volcano tasks: - maxRetry: 3 minAvailable: 1 name: myvcjob-0 replicas: 1 template: metadata: annotations: sidecar.istio.io/inject: 'false' labels: app.kubernetes.io/instance: myvcjob-jl7rjr9wqc35x app.kubernetes.io/managed-by: torchx.pytorch.org app.kubernetes.io/name: app beta.kubernetes.io/instance-type: g4-standard-48 cloud.google.com/gke-gpu: 'true' cloud.google.com/gke-spot: 'false' node.kubernetes.io/instance-type: g4-standard-48 nvidia.com/gpu: present provisioning-model: on-demand sku: L4-Quad torchx.pytorch.org/app-name: app torchx.pytorch.org/replica-id: '0' torchx.pytorch.org/role-index: '0' torchx.pytorch.org/role-name: app torchx.pytorch.org/version: 0.6.0 volcano.sh/gpu-memory: '40000' spec: affinity: {} containers: - command: - bash - '-c' - >- newrelic-admin run-program torchrun --rdzv_backend c10d --rdzv_endpoint localhost:0 --rdzv_id 'myvcjob-jl7rjr9wqc35x' --nnodes 1 --nproc_per_node 4 --tee 3 --role '' -m myapp.app.components.app --job_id f5b3d845-e111-49bf-92fd-c935e50fa0da env: - name: TORCHX_TRACKING_EXPERIMENT_NAME value: default-experiment - name: LOGLEVEL value: WARNING - name: TORCHX_JOB_ID value: kubernetes://torchx/myvcjob-jl7rjr9wqc35x - name: TORCHX_RANK0_HOST value: localhost image: >- myimage:mytag name: myvcjob-0 ports: - containerPort: 29500 name: c10d protocol: TCP resources: limits: cpu: '44' memory: 45G nvidia.com/gpu: '4' requests: cpu: 43900m memory: 43976M nvidia.com/gpu: '4' securityContext: {} volumeMounts: - mountPath: /dev/shm name: dshm restartPolicy: Never tolerations: - effect: NoSchedule key: sku operator: Equal value: L4-Quad - effect: NoSchedule key: cloud.google.com/gke-spot operator: Equal value: 'false' - effect: NoSchedule key: nvidia.com/gpu operator: Equal value: present volumes: - emptyDir: medium: Memory name: dshm ```
Scheduler Config ``` actions: "enqueue,allocate,reclaim,backfill" tiers: - plugins: - name: priority - name: gang enablePreemptable: false - name: conformance - plugins: - name: overcommit - name: drf enablePreemptable: false - name: predicates arguments: predicate.VGPUEnable: true - name: proportion - name: nodeorder - name: binpack ```

This is just hapennig with volcano jobs, any other pod that we create can trigger the scale up without issues

Monokaix commented 9 months ago

Hi, can you paste pod yaml output? I saw that job entered a failed state.

    - lastTransitionTime: '2024-01-23T20:27:44Z'
      status: Failed