kubeflow / examples

A repository to host extended examples and tutorials
Apache License 2.0
1.39k stars 751 forks source link

how to run pytorch mnist ddp #1040

Open wangpf09 opened 1 year ago

wangpf09 commented 1 year ago

I have kubeflow deployed now, but there is a problem running the official mnist example, how should I solve it? The yml of PytorchJob is as follows:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-mnist-ddp-gpu
  namespace: kubeflow-user-example-com
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - image: gcr.io/kubeflow-examples/pytorch-mnist-ddp-gpu
              name: pytorch
              resources:
                limits:
                  cpu: '1'
                  memory: 4Gi
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /mnt/kubeflow-gcfs
                  name: kubeflow-gcfs
          volumes:
            - name: kubeflow-gcfs
              persistentVolumeClaim:
                claimName: kubeflow-gcfs
                readOnly: false
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - image: gcr.io/kubeflow-examples/pytorch-mnist-ddp-gpu
              name: pytorch
              resources:
                limits:
                  cpu: '1'
                  memory: 4Gi
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /mnt/kubeflow-gcfs
                  name: kubeflow-gcfs
          volumes:
            - name: kubeflow-gcfs
              persistentVolumeClaim:
                claimName: kubeflow-gcfs
                readOnly: false

8d731664134b224973a790c50a2885d