kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
419 stars 210 forks source link

how can i deploy distributed training on kubernete clusters with torch.distributed.launch #560

Open ThomaswellY opened 1 year ago

ThomaswellY commented 1 year ago

I have been using mmpretrain project https://github.com/open-mmlab/mmpretrain, which consists of abundant of classification scripts. However, they use torch.distributed.launch to start distributed training, I wonder is there any method under kubeflow operators to start such distributed training on k8s cluster? PS: i have seeked help to training-operator and pytorch-operator, but can't see obvious solution. Thanks in advance~ any hints would be helpful to me.

alculquicondor commented 1 year ago

Are you in the wrong repo? This repo is about MPI. Pytorch is supported in https://github.com/kubeflow/training-operator

tenzen-y commented 1 year ago

@ThomaswellY If you would run torchrun, you should open an issues at training-operator repo. If you would run Distributed Pytorch Training with mpirun, we can answer your questions at this repo.

Which commands do you mean?

ThomaswellY commented 1 year ago

@alculquicondor @tenzen-y I was looking to how to modify the original script which originally use torch.distributed.launch to start training to use mpirun to start training in mpi-operator.