Open wangpf09 opened 1 year ago
I have kubeflow deployed now, but there is a problem running the official mnist example, how should I solve it? The yml of PytorchJob is as follows:
apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: pytorch-mnist-ddp-gpu namespace: kubeflow-user-example-com spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: spec: containers: - image: gcr.io/kubeflow-examples/pytorch-mnist-ddp-gpu name: pytorch resources: limits: cpu: '1' memory: 4Gi nvidia.com/gpu: 1 volumeMounts: - mountPath: /mnt/kubeflow-gcfs name: kubeflow-gcfs volumes: - name: kubeflow-gcfs persistentVolumeClaim: claimName: kubeflow-gcfs readOnly: false Worker: replicas: 2 restartPolicy: OnFailure template: spec: containers: - image: gcr.io/kubeflow-examples/pytorch-mnist-ddp-gpu name: pytorch resources: limits: cpu: '1' memory: 4Gi nvidia.com/gpu: 1 volumeMounts: - mountPath: /mnt/kubeflow-gcfs name: kubeflow-gcfs volumes: - name: kubeflow-gcfs persistentVolumeClaim: claimName: kubeflow-gcfs readOnly: false
I have kubeflow deployed now, but there is a problem running the official mnist example, how should I solve it? The yml of PytorchJob is as follows: