kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

how to create a local non-distributed training #287

Closed houz42 closed 4 years ago

houz42 commented 4 years ago

as defined in the crd, worker replicas must >= 1, and master replica == 1, so how to create such a training job runs on single node?

https://github.com/kubeflow/pytorch-operator/blob/eba73411bc03d70b72dcab623aa7a01c14f811d4/manifests/crd.yaml#L37

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/question 0.69

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.57

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

houz42 commented 4 years ago

@gaocegege you metioned in another issue (https://github.com/kubeflow/pytorch-operator/issues/278#issuecomment-642353290) that:

you can have 1 Master to run local training jobs

But I just can not create a pytorchjob without worker.

gaocegege commented 4 years ago

Then you can try to create one worker job.

houz42 commented 4 years ago

Then you can try to create one worker job.

A pytorchjob with no master? can't create it neither, master replica must be 1

gaocegege commented 4 years ago

I tried to create one master job and it works. Can you explain more about But I just can not create a pytorchjob without worker.

Is there any error during the run?

houz42 commented 4 years ago

Is there any error during the run?

I finally realized it was my fault. I dit not create "a pytorchjob with only 1 master", but "a pytorchjob with 1 master and 0 worker", which was denied during validation.

Sorry for my mistake and thanks for your patient.