kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.61k stars 700 forks source link

Add JAX controller #2194

Closed sandipanpanda closed 1 month ago

sandipanpanda commented 3 months ago

What this PR does / why we need it: Implement JAX controller

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Fix: https://github.com/kubeflow/training-operator/issues/1619 Ref:https://github.com/kubeflow/training-operator/issues/2145

/area gsoc

sandipanpanda commented 3 months ago

/cc @andreyvelich @tenzen-y @terrytangyuan

coveralls commented 3 months ago

Pull Request Test Coverage Report for Build 10960403998

Details


Totals Coverage Status
Change from base Build 10937611143: 0.0%
Covered Lines: 66
Relevant Lines: 66

💛 - Coveralls
shravan-achar commented 3 months ago

@sandipanpanda - Is this PR ready for review?

tenzen-y commented 3 months ago

@sandipanpanda - Is this PR ready for review?

Actually, no. @sandipanpanda continues trying some implementations. Regarding the remaining implementations, you can find the weekly sync documentation.

sandipanpanda commented 3 months ago

Adding jaxjob webhook test, examples and finishing up some work on test_e2e_jaxjob.py remain. Can you please share your input if the current implementation up until now is in the correct direction?

andreyvelich commented 3 months ago

@sandipanpanda If this PR is ready for review, please review the WIP from the PR title.

tenzen-y commented 2 months ago

@sandipanpanda Additionally, could you add Dockerfile to build pipeline? https://github.com/kubeflow/training-operator/blob/6ddeb2b90ebe116beaa800c57c344913e78aaf38/.github/workflows/publish-example-images.yaml

tenzen-y commented 2 months ago

I guess that the current integration testing error could be resolved by updating this file: https://github.com/kubeflow/training-operator/blob/6ddeb2b90ebe116beaa800c57c344913e78aaf38/manifests/base/webhook/patch.yaml

tenzen-y commented 2 months ago

@sandipanpanda Could you address this error? We need to pass the appropriate arguments to e2e testing.

DEBUG kubernetes.client.rest:rest.py:235 response body: FATAL Flags parsing error: flag --job_name=None: Flag --job_name must have a value other than None. flag --sub_domain=None: Flag --sub_domain must have a value other than None. flag --coordinator_port=None: Flag --coordinator_port must have a value other than None. Pass --helpshort or --helpfull to see help on flags.

DEBUG kubernetes.client.rest:rest.py:235 response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Success","details":{"name":"jaxjob-cpu-ci-test","group":"kubeflow.org","kind":"jaxjobs","uid":"f7402a7c-588a-4b9d-903d-969ba0d4c7e2"}}

https://github.com/kubeflow/training-operator/actions/runs/10753814852/job/29823692795?pr=2194#step:9:5832

sandipanpanda commented 2 months ago

cc @tenzen-y PTAL

google-oss-prow[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/training-operator/blob/master/OWNERS)~~ [tenzen-y] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
sandipanpanda commented 1 month ago

Thank you for your unwavering guidance and support throughout this project!