Closed andreyvelich closed 1 week ago
@andreyvelich: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue
command.
I believe that both (training-operator and mpi-operator) examples would be worth it. But, I think that we should add each example for PyTorchJob with deepspeed and torchrun, and MPIJob v2 with deepspeed and mpirun.
Sure, that sound great @tenzen-y!
It would be great to see the benchmarks for mpirun
and torchrun
to run DeepSpeed on Kubernetes.
Sure, that sound great @tenzen-y! It would be great to see the benchmarks for
mpirun
andtorchrun
to run DeepSpeed on Kubernetes.
It sounds great, but I guess that there are no significant performance differences between both approaches since the deepspeed uses the NCCL backend even if we use mpirun.
I'm working on an equivalent example for the Flux Operator - but quick question. Will it work OK to test without GPU? I've been trying to get just 3 nodes, each with one nvidia GPU on Google Cloud, and I never get the allocation.
Ah - this looks more promising. https://github.com/kubeflow/mpi-operator/pull/567/files
@tenzen-y Does DeepSpeed only support nccl backend ? E.g. we can't run it with CPUs ?
@tenzen-y Does DeepSpeed only support nccl backend ? E.g. we can't run it with CPUs ?
TBH, I don't have any experience only with CPU. But at the first glance, the deepspeed seems to support PyTorch without GPU: https://github.com/microsoft/DeepSpeed/blob/master/.github/workflows/cpu-torch-latest.yml
that there are no significant performance differences between both approaches since the deepspeed uses the NCCL backend even if we use mpirun.
This statement is generally correct in almost all cases with NCCL context. Though, I've two few experiences to share for those using mpi-style set-up and suffering performance issue.
Overall, mpirun
and torchrun
should have no performance difference.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
Related: https://github.com/kubeflow/training-operator/issues/2040
As we discussed multiple times, Kubeflow community are looking for examples on how to use MPI Operator and DeepSpeed.
We should add some example to the MPI Operator: https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1 or Training Operator: https://github.com/kubeflow/training-operator/tree/master/examples.
Some pending PRs can be found here as reference:
/good-first-issue /help /area example
/cc @alculquicondor @kubeflow/wg-training-leads @kuizhiqing