pytorch / torchtune

PyTorch native finetuning library
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.39k stars 448 forks source link

Migrate state dict API to DSD #1930

Open mori360 opened 1 month ago

mori360 commented 1 month ago

Migrate state dict API to DSD:

What is the purpose of this PR? Is it to

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR? *

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it. Here is a docstring example and a tutorial example

pytorch-bot[bot] commented 1 month ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1930

Note: Links to docs will display an error until the docs builds have been completed.

:x: 1 New Failure, 2 Cancelled Jobs

As of commit 1ad30263c317e03a10c497d996e7490f603a9aad with merge base 7bfb3336446f0d874ab5d4595249839b735b7076 (image):

NEW FAILURE - The following job has failed:

* [GPU tests / gpu_test (3.11, stable)](https://hud.pytorch.org/pr/pytorch/torchtune/1930#32728741376) ([gh](https://github.com/pytorch/torchtune/actions/runs/11747237995/job/32728741376)) `tests/torchtune/training/test_distributed.py::TestFullyShardState::test_qlora_state_dict`

CANCELLED JOBS - The following jobs were cancelled. Please retry:

* [GPU tests / gpu_test (3.10, stable)](https://hud.pytorch.org/pr/pytorch/torchtune/1930#32728740056) ([gh](https://github.com/pytorch/torchtune/actions/runs/11747237995/job/32728740056)) `tests/torchtune/training/test_distributed.py::TestFullyShardState::test_qlora_state_dict` * [GPU tests / gpu_test (3.9, stable)](https://hud.pytorch.org/pr/pytorch/torchtune/1930#32728739344) ([gh](https://github.com/pytorch/torchtune/actions/runs/11747237995/job/32728739344)) `##[error]The operation was canceled.`

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers commented 1 week ago

Hi @mori360 I missed this one until now. Can you share more context on this PR?