kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.58k stars 689 forks source link

Add DeepSpeed Example with MPI Operator #2091

Closed andreyvelich closed 1 week ago

andreyvelich commented 5 months ago

Related: https://github.com/kubeflow/training-operator/issues/2040

As we discussed multiple times, Kubeflow community are looking for examples on how to use MPI Operator and DeepSpeed.

We should add some example to the MPI Operator: https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1 or Training Operator: https://github.com/kubeflow/training-operator/tree/master/examples.

Some pending PRs can be found here as reference:

/good-first-issue /help /area example

/cc @alculquicondor @kubeflow/wg-training-leads @kuizhiqing

google-oss-prow[bot] commented 5 months ago

@andreyvelich: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/kubeflow/training-operator/issues/2091): >Related: https://github.com/kubeflow/training-operator/issues/2040 > >As we discussed multiple times, Kubeflow community are looking for examples on how to use MPI Operator and [DeepSpeed](https://github.com/microsoft/DeepSpeed). > >We should add some example to the MPI Operator: https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1 or Training Operator: https://github.com/kubeflow/training-operator/tree/master/examples. > >Some pending PRs can be found here as reference: >- https://github.com/kubeflow/mpi-operator/pull/610 >- https://github.com/kubeflow/mpi-operator/pull/567 > > >/good-first-issue >/help >/area example > >/cc @alculquicondor @kubeflow/wg-training-leads @kuizhiqing > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tenzen-y commented 5 months ago

I believe that both (training-operator and mpi-operator) examples would be worth it. But, I think that we should add each example for PyTorchJob with deepspeed and torchrun, and MPIJob v2 with deepspeed and mpirun.

andreyvelich commented 5 months ago

Sure, that sound great @tenzen-y! It would be great to see the benchmarks for mpirun and torchrun to run DeepSpeed on Kubernetes.

tenzen-y commented 5 months ago

Sure, that sound great @tenzen-y! It would be great to see the benchmarks for mpirun and torchrun to run DeepSpeed on Kubernetes.

It sounds great, but I guess that there are no significant performance differences between both approaches since the deepspeed uses the NCCL backend even if we use mpirun.

vsoch commented 5 months ago

I'm working on an equivalent example for the Flux Operator - but quick question. Will it work OK to test without GPU? I've been trying to get just 3 nodes, each with one nvidia GPU on Google Cloud, and I never get the allocation.

vsoch commented 5 months ago

Ah - this looks more promising. https://github.com/kubeflow/mpi-operator/pull/567/files

andreyvelich commented 5 months ago

@tenzen-y Does DeepSpeed only support nccl backend ? E.g. we can't run it with CPUs ?

tenzen-y commented 5 months ago

@tenzen-y Does DeepSpeed only support nccl backend ? E.g. we can't run it with CPUs ?

TBH, I don't have any experience only with CPU. But at the first glance, the deepspeed seems to support PyTorch without GPU: https://github.com/microsoft/DeepSpeed/blob/master/.github/workflows/cpu-torch-latest.yml

kuizhiqing commented 3 months ago

that there are no significant performance differences between both approaches since the deepspeed uses the NCCL backend even if we use mpirun.

This statement is generally correct in almost all cases with NCCL context. Though, I've two few experiences to share for those using mpi-style set-up and suffering performance issue.

Overall, mpirun and torchrun should have no performance difference.

github-actions[bot] commented 4 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 week ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.