Closed rami3e closed 3 years ago
Do you mean submitting a distributed pytorch job using kfp sdk?
yes, i understand the distributed training procedure but i see that pipelines submits an argo workflow and am not sure how to coordinate the required global variables as well as the autoscaler picking up the required number of nodes requested (i.e master + x workers)
I think it is recommended to use https://github.com/kubeflow/pipelines/tree/master/components/kubeflow/launcher/src to submits distributed training jobs in Kubeflow pipelines. But we do not have such a launcher for PyTorchJob now. I think we should implement it.
/cc @johnugeorge @andreyvelich @Bobgy
or you can create the kubernetes custom resource via https://github.com/kubeflow/pipelines/blob/master/samples/core/resource_ops/resource_ops.py in a pipeline step
I see there are similar issues but still could not get a concrete result in terms of the above. is there an example showing multi node distributed training via the SDK? is launch.py the current preferred approach? and how can i get the autoscaler to work with distributed node requests? thanks