volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.24k stars 971 forks source link

Distributed training with pytorch on multi-nodes #3795

Open almersawi opened 3 weeks ago

almersawi commented 3 weeks ago

Please describe your problem in detail

I'm trying to start a pytorch training using volcano and pytorch plugin. I have 2 nodes, each with 8 gpus. I found that volcano sets WORLD_SIZE = 2, RANK = 0 (first pod), 1 (second pod) but I couldn't find the LOCAL_RANK in the env vars so I can target each GPU. My question is, is it possible to use multiple gpus in each pod or it's just one gpu per pod. If it's possible, what am I missing in my configurations?

This is the tasks part in my manifest:

tasks:
    - name: master
      minAvailable: 1
      replicas: 1
      template:
        spec:
          containers:
              envFrom:
                - configMapRef:
                    name: configmap-8-nodes-pvc-test-master
                - secretRef:
                    name: secrets-8-nodes-pvc-test-master
              image: sample-image
              imagePullPolicy: Always
              name: master
              resources:
                limits:
                  amd.com/gpu: '8'
                  cpu: '32'
                  ephemeral-storage: 100Gi
                  memory: 120Gi
                requests:
                  amd.com/gpu: '8'
                  cpu: '1'
                  ephemeral-storage: 100Gi
                  memory: 120Gi
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
                - mountPath: /home
                  name: jobs-dir
          imagePullSecrets:
            - name: gitlab-registry-credentials
          nodeSelector:
            kubernetes.oip/gpu-name: mi250
          restartPolicy: OnFailure
          tolerations:
            - effect: NoSchedule
              key: amd.com/gpu
              operator: Exists
    - name: worker
      minAvailable: 1
      replicas: 1
      template:
        spec:
          containers:
              envFrom:
                - configMapRef:
                    name: configmap-8-nodes-pvc-test-master
                - secretRef:
                    name: secrets-8-nodes-pvc-test-master
              image: sample-image
              imagePullPolicy: Always
              name: worker
              resources:
                limits:
                  amd.com/gpu: '8'
                  cpu: '32'
                  ephemeral-storage: 100Gi
                  memory: 120Gi
                requests:
                  amd.com/gpu: '8'
                  cpu: '1'
                  ephemeral-storage: 100Gi
                  memory: 120Gi
          imagePullSecrets:
            - name: gitlab-registry-credentials
          nodeSelector:
            kubernetes.oip/gpu-name: mi250
          restartPolicy: OnFailure
          tolerations:
            - effect: NoSchedule
              key: amd.com/gpu
              operator: Exists

Any other relevant information

No response

lowang-bh commented 3 weeks ago

set amd.com/gpu: '8' means each pod request 8 gpus.