kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
419 stars 210 forks source link

(add) deepspeed_mpi specific container, deepspeed_config for MPI with nodetaints #549

Closed ghost closed 1 year ago

ghost commented 1 year ago

This MR introduces an integration example of DeepSpeed, a distributed training library, with Kubeflow to the main mpi-operator examples. The objective of this example is to enhance the efficiency and performance of distributed training jobs by harnessing the combined capabilities of DeepSpeed and MPI. Comments in configuration explains the use of taints and tolerations in the Kubernetes configuration to ensure the proper scheduling of DeepSpeed worker pods on nodes with specific resources, such as GPUs.

google-cla[bot] commented 1 year ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

google-oss-prow[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
Syulin7 commented 1 year ago

DeepSpeed configures multi-node compute resources with hostfiles that are compatible with OpenMPI. A hostfile is a list of hostnames (or SSH aliases), which are machines accessible via passwordless SSH.

Do we need to support Deepspeed's own parallel launcher (via pdsh) in mpi-operator? The difference is that the default path for the hostfile in Deepspeed is /job/hostfile. Therefore, if the operator can generate /job/hostfile (like horovod discover_hosts.sh), it can support Deepspeed's own parallel launcher.

Ref: https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node

@dtunai @alculquicondor @tenzen-y WDYT?

tenzen-y commented 1 year ago

DeepSpeed configures multi-node compute resources with hostfiles that are compatible with OpenMPI. A hostfile is a list of hostnames (or SSH aliases), which are machines accessible via passwordless SSH.

Do we need to support Deepspeed's own parallel launcher (via pdsh) in mpi-operator? The difference is that the default path for the hostfile in Deepspeed is /job/hostfile. Therefore, if the operator can generate /job/hostfile (like horovod discover_hosts.sh), it can support Deepspeed's own parallel launcher.

Through the above document, I don't think we need to generate the hostile in /job/hostfile since the users can set the hostile path via the deepspeed command, and the deepspeed uses the same format with OpenMPI for the hostile.

IIRC, we generate discover_hosts.sh for the horovod since the horovod uses a different format than OpenMPI for the discovery.sh.

tenzen-y commented 1 year ago

Let me know if I'm missing any other.

Syulin7 commented 1 year ago

Your understanding is correct. Currently, DeepSpeed supports the following three forms:

  1. like this pr, launched with mpirun. mpirun python train.py --deepspeed_mpi

  2. launched with the "deepspeed" command, which will read the /job/hostfile file by default and via pdsh. deepspeed train.py

  3. launched with the "deepspeed" command, setting --hostfile=/etc/mpi/hostfile deepspeed --hostfile=/etc/mpi/hostfile train.py

Therefore, if we need to support the second and third forms in mpi-operator, perhaps we can remind users in the document that they must set --hostfile=/etc/mpi/hostfile?

I would like to add a new example to do this.

tenzen-y commented 1 year ago

Your understanding is correct. Currently, DeepSpeed supports the following three forms:

  1. like this pr, launched with mpirun. mpirun python train.py --deepspeed_mpi
  2. launched with the "deepspeed" command, which will read the /job/hostfile file by default and via pdsh. deepspeed train.py
  3. launched with the "deepspeed" command, setting --hostfile=/etc/mpi/hostfile deepspeed --hostfile=/etc/mpi/hostfile train.py

Therefore, if we need to support the second and third forms in mpi-operator, perhaps we can remind users in the document that they must set --hostfile=/etc/mpi/hostfile?

I would like to add a new example to do this.

Thank you for clarifying. Probably, it is enough to add an example for the first (this PR) and third forms.

@alculquicondor wdyt?

alculquicondor commented 1 year ago

Is there a way to specify the hostfile via environment variable? That's how we do it for mpirun.

Are there any changes required to have pdsh work? Or maybe some features can be disabled, such as the secret that contains ssh keys?

Syulin7 commented 1 year ago

Is there a way to specify the hostfile via environment variable? That's how we do it for mpirun.

I couldn't find an environment variable to specify the hostfile(like 'OMPI_MCA_orte_default_hostfile') in the Deepspeed document.

Are there any changes required to have pdsh work? Or maybe some features can be disabled, such as the secret that contains ssh keys?

The secret that contains SSH keys is necessary, and PDSH also accesses workers via passwordless SSH. Based on my testing, the hostfile and passwordless SSH are sufficient for pdsh to work.

alculquicondor commented 1 year ago

I guess we can move forward with this PR and provide another example using deepspeed --hostfile. @dogukanutuna did you have a chance to make this work with mpioperator/base?

tenzen-y commented 1 year ago

@simulark Why did you close this PR?

ghost commented 1 year ago

@simulark Why did you close this PR?

Hello @tenzen-y. It was an unintended action, if you cannot restore it right now, I can bring a new PR with the mpioperator/base.

tenzen-y commented 1 year ago

@simulark Why did you close this PR?

Hello @tenzen-y. It was an unintended action, if you cannot restore it right now, I can bring a new PR with the mpioperator/base.

Oh, I see. Thank you for letting me know!