Closed ghost closed 1 year ago
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
@tenzen-y fyi!
Containers now able to carry the DS library and DS applied model application with the patched mpioperator/base.
This MR replication of https://github.com/kubeflow/mpi-operator/pull/549 and introduces an integration example of DeepSpeed, a distributed training library, with Kubeflow to the main mpi-operator examples. The objective of this example is to enhance the efficiency and performance of distributed training jobs by harnessing the combined capabilities of DeepSpeed and MPI. Comments in configuration explains the use of taints and tolerations in the Kubernetes configuration to ensure the proper scheduling of DeepSpeed worker pods on nodes with specific resources, such as GPUs.
By following further discussions at the #549 this PR will be implemented with mpioperator/base sooner.