Open ThomaswellY opened 1 year ago
Are you in the wrong repo? This repo is about MPI. Pytorch is supported in https://github.com/kubeflow/training-operator
@ThomaswellY If you would run torchrun
, you should open an issues at training-operator repo. If you would run Distributed Pytorch Training with mpirun
, we can answer your questions at this repo.
Which commands do you mean?
@alculquicondor @tenzen-y I was looking to how to modify the original script which originally use torch.distributed.launch to start training to use mpirun to start training in mpi-operator.
I have been using mmpretrain project https://github.com/open-mmlab/mmpretrain, which consists of abundant of classification scripts. However, they use torch.distributed.launch to start distributed training, I wonder is there any method under kubeflow operators to start such distributed training on k8s cluster? PS: i have seeked help to training-operator and pytorch-operator, but can't see obvious solution. Thanks in advance~ any hints would be helpful to me.