kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 701 forks source link

[SDK] Allow customising base trainer and storage images in Train API #2261

Closed varshaprasad96 closed 2 months ago

varshaprasad96 commented 2 months ago

What this PR does / why we need it: Allow customising base storage_initializer and trainer images through Env vars. Example use case: Train API could be expanded to use ROCm libs in addition to CUDA.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Fixes #2247

TODO: Docs to be updated in https://github.com/kubeflow/website.

Checklist:

coveralls commented 2 months ago

Pull Request Test Coverage Report for Build 10927951593

Details


Totals Coverage Status
Change from base Build 10927738808: 100.0%
Covered Lines: 66
Relevant Lines: 66

💛 - Coveralls
google-oss-prow[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[sdk/python/OWNERS](https://github.com/kubeflow/training-operator/blob/master/sdk/python/OWNERS)~~ [tenzen-y] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
tenzen-y commented 2 months ago

/assign @deepanker13

deepanker13 commented 2 months ago

Thanks @varshaprasad96
/lgtm