kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

kubeflow pipelines sdk, distributed multi-node training with autoscaling #312

Closed rami3e closed 3 years ago

rami3e commented 3 years ago

I see there are similar issues but still could not get a concrete result in terms of the above. is there an example showing multi node distributed training via the SDK? is launch.py the current preferred approach? and how can i get the autoscaler to work with distributed node requests? thanks

gaocegege commented 3 years ago

Do you mean submitting a distributed pytorch job using kfp sdk?

rami3e commented 3 years ago

yes, i understand the distributed training procedure but i see that pipelines submits an argo workflow and am not sure how to coordinate the required global variables as well as the autoscaler picking up the required number of nodes requested (i.e master + x workers)

gaocegege commented 3 years ago

I think it is recommended to use https://github.com/kubeflow/pipelines/tree/master/components/kubeflow/launcher/src to submits distributed training jobs in Kubeflow pipelines. But we do not have such a launcher for PyTorchJob now. I think we should implement it.

/cc @johnugeorge @andreyvelich @Bobgy

Bobgy commented 3 years ago

or you can create the kubernetes custom resource via https://github.com/kubeflow/pipelines/blob/master/samples/core/resource_ops/resource_ops.py in a pipeline step