kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

how can I run a pytorch job with all my Gpu resources #304

Closed lwj1980s closed 3 years ago

lwj1980s commented 3 years ago

my master node does not have a nvidia gpu resource, my worker node has 4 GTX 1080ti cards, I want to use all my GPUs, so I write resources: limits: nvidia.com/gpu: 4 in my_deploy.yaml. but I encountered warning of 3 Insufficient nvidia.com/gpu, so, how can I use all my GPUs in training process

environment: ubuntu 18.04 tty k8s:1.14.0 kubeflow:1.0.2

here is my-deploy.yaml file:

apiVersion: "kubeflow.org/v1" kind: "PyTorchJob" metadata: name: "pytorch-dist-mnist-nccl" spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers:

here is part of the log:

Events: Type Reason Age From Message


Warning FailedScheduling 43s (x3 over 3m39s) default-scheduler 0/3 nodes are available: 3 Insufficient nvidia.com/gpu.

issue-label-bot[bot] commented 3 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/question 0.74

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

lwj1980s commented 3 years ago

I modified the yaml file as bellow, 4 GPU works synchronized , but I am not sure that if it is right, does that realized distribute training?

apiVersion: "kubeflow.org/v1" kind: "PyTorchJob" metadata: name: "pytorch-dist-mnist-nccl" spec: pytorchReplicaSpecs:

Master:
  replicas: 1
  restartPolicy: OnFailure
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      containers:
        - name: pytorch
          image: 192.168.0.156:30002/library/kubeflow-mnist-test:with-data
          imagePullPolicy: IfNotPresent
          args: ["--backend", "nccl"]
          resources:
            limits:
              nvidia.com/gpu: 1

Worker:
  replicas: 3
  restartPolicy: OnFailure
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      containers:
        - name: pytorch
          image: 192.168.0.156:30002/library/kubeflow-mnist-test:with-data
          imagePullPolicy: IfNotPresent
          args: ["--backend", "nccl"]
          resources:
            limits:
              nvidia.com/gpu: 1
Jeffwan commented 3 years ago

"encountered warning of 3 Insufficient nvidia.com/gpu,"

Do you have all your 4 GPUs on one node? Since you use 1 worker before, node has to have 4 GPUs to be qualified to deploy the pod

lwj1980s commented 3 years ago

"encountered warning of 3 Insufficient nvidia.com/gpu,"

Do you have all your 4 GPUs on one node? Since you use 1 worker before, node has to have 4 GPUs to be qualified to deploy the pod

yes,I have all my 4 GPUs on one node,I think I have known why it had failed, thank you very much