Closed lwj1980s closed 3 years ago
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/question | 0.74 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
I modified the yaml file as bellow, 4 GPU works synchronized , but I am not sure that if it is right, does that realized distribute training?
apiVersion: "kubeflow.org/v1" kind: "PyTorchJob" metadata: name: "pytorch-dist-mnist-nccl" spec: pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: 192.168.0.156:30002/library/kubeflow-mnist-test:with-data
imagePullPolicy: IfNotPresent
args: ["--backend", "nccl"]
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 3
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: 192.168.0.156:30002/library/kubeflow-mnist-test:with-data
imagePullPolicy: IfNotPresent
args: ["--backend", "nccl"]
resources:
limits:
nvidia.com/gpu: 1
"encountered warning of 3 Insufficient nvidia.com/gpu,"
Do you have all your 4 GPUs on one node? Since you use 1 worker before, node has to have 4 GPUs to be qualified to deploy the pod
"encountered warning of 3 Insufficient nvidia.com/gpu,"
Do you have all your 4 GPUs on one node? Since you use 1 worker before, node has to have 4 GPUs to be qualified to deploy the pod
yes,I have all my 4 GPUs on one node,I think I have known why it had failed, thank you very much
my master node does not have a nvidia gpu resource, my worker node has 4 GTX 1080ti cards, I want to use all my GPUs, so I write
resources: limits: nvidia.com/gpu: 4
in my_deploy.yaml. but I encountered warning of 3 Insufficient nvidia.com/gpu, so, how can I use all my GPUs in training processenvironment: ubuntu 18.04 tty k8s:1.14.0 kubeflow:1.0.2
here is my-deploy.yaml file:
apiVersion: "kubeflow.org/v1" kind: "PyTorchJob" metadata: name: "pytorch-dist-mnist-nccl" spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers:
here is part of the log:
Events: Type Reason Age From Message
Warning FailedScheduling 43s (x3 over 3m39s) default-scheduler 0/3 nodes are available: 3 Insufficient nvidia.com/gpu.