kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 700 forks source link

FSDP Example for T5 Fine-Tuning and PyTorchJob #2286

Closed andreyvelich closed 1 month ago

andreyvelich commented 1 month ago

Related: https://github.com/kubeflow/training-operator/issues/2040

I added simple FSDP example for T5 fine-tuning and PyTorchJob.

I refactored example from the PyTorch tutorial: https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html

/assign @kubeflow/wg-training-leads

review-notebook-app[bot] commented 1 month ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

andreyvelich commented 1 month ago

cc @kuizhiqing @Syulin7

coveralls commented 1 month ago

Pull Request Test Coverage Report for Build 11409185571

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details


Totals Coverage Status
Change from base Build 11330381194: 0.0%
Covered Lines: 73
Relevant Lines: 73

💛 - Coveralls
Syulin7 commented 1 month ago

Good work! /lgtm

andreyvelich commented 1 month ago

Thanks for the review! /approve

google-oss-prow[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/training-operator/blob/master/OWNERS)~~ [andreyvelich] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
tenzen-y commented 1 month ago

Thanks for the review! /approve

Why the approve label is not automatically added to this PR? IIUC, the PRs created by approver have approve label.

andreyvelich commented 1 month ago

@tenzen-y Actually, we disabled it as part of this PR: https://github.com/GoogleCloudPlatform/oss-test-infra/pull/2271. That will help us to not accidentally merge PRs that are not ready and got /lgtm from the GitHub members. We noticed that, not all contributors add the /hold label if PR is not ready.

tenzen-y commented 1 month ago

@tenzen-y Actually, we disabled it as part of this PR: https://github.com/GoogleCloudPlatform/oss-test-infra/pull/2271.

That will help us to not accidentally merge PRs that are not ready and got /lgtm from the GitHub members.

We noticed that, not all contributors add the /hold label if PR is not ready.

That makes sense. Thank you for clarifications. That seems the limitations (accidentally merging) for collaborations between GitHub Actions and Prow :(