[SDK] Allow customising base trainer and storage images in Train API

kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes

https://www.kubeflow.org/docs/components/training

Apache License 2.0

1.62k stars 701 forks source link

[SDK] Allow customising base trainer and storage images in Train API #2261

Closed varshaprasad96 closed 2 months ago

varshaprasad96 commented 2 months ago

What this PR does / why we need it: Allow customising base storage_initializer and trainer images through Env vars. Example use case: Train API could be expanded to use ROCm libs in addition to CUDA.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Fixes #2247

TODO: Docs to be updated in https://github.com/kubeflow/website.

Checklist:

[ ] Docs included if any changes are user facing

coveralls commented 2 months ago

Pull Request Test Coverage Report for Build 10927951593

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall first build on sdk/fetch-base-image at 100.0%

Totals
Change from base Build 10927738808:	100.0%
Covered Lines:	66
Relevant Lines:	66

💛 - Coveralls

google-oss-prow[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[sdk/python/OWNERS](https://github.com/kubeflow/training-operator/blob/master/sdk/python/OWNERS)~~ [tenzen-y] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

tenzen-y commented 2 months ago

/assign @deepanker13

deepanker13 commented 2 months ago

Thanks @varshaprasad96
/lgtm