pytorch / torchtune

PyTorch native finetuning library
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.35k stars 440 forks source link

make RMSNorm module compatible with FSDP #1956

Closed anshulverma closed 4 days ago

anshulverma commented 2 weeks ago

Summary: When parameters are initialized with meta device, FSDP calls reset_parameters function automatically if param_init_fn is not specified. For more details, see wiki

Differential Revision: D65496443

pytorch-bot[bot] commented 2 weeks ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1956

Note: Links to docs will display an error until the docs builds have been completed.

:x: 4 New Failures, 3 Cancelled Jobs

As of commit 304da4aa27d832c1c418556288bbe88a3e00b72f with merge base 4389b4d81398da0890aa686ef38cc15c898e2036 (image):

NEW FAILURES - The following jobs have failed:

* [GPU tests / gpu_test (3.10, stable)](https://hud.pytorch.org/pr/pytorch/torchtune/1956#32611742077) ([gh](https://github.com/pytorch/torchtune/actions/runs/11699400748/job/32611742077)) `tests/torchtune/training/test_distributed.py::TestLoRAFSDP::test_lora_fsdp_wrap` * [GPU tests / gpu_test (3.11, stable)](https://hud.pytorch.org/pr/pytorch/torchtune/1956#32611742488) ([gh](https://github.com/pytorch/torchtune/actions/runs/11699400748/job/32611742488)) `tests/torchtune/training/test_distributed.py::TestLoRAFSDP::test_lora_fsdp_wrap` * [Lint / lint (3.10)](https://hud.pytorch.org/pr/pytorch/torchtune/1956#32611742542) ([gh](https://github.com/pytorch/torchtune/actions/runs/11699400753/job/32611742542)) `##[error]Process completed with exit code 1.` * [Unit Test / unit_tests (3.11)](https://hud.pytorch.org/pr/pytorch/torchtune/1956#32611742859) ([gh](https://github.com/pytorch/torchtune/actions/runs/11699400779/job/32611742859)) `tests/torchtune/training/test_distributed.py::TestLoRAFSDP::test_lora_fsdp_wrap`

CANCELLED JOBS - The following jobs were cancelled. Please retry:

* [GPU tests / gpu_test (3.9, stable)](https://hud.pytorch.org/pr/pytorch/torchtune/1956#32611741458) ([gh](https://github.com/pytorch/torchtune/actions/runs/11699400748/job/32611741458)) `tests/torchtune/training/test_distributed.py::TestLoRAFSDP::test_lora_fsdp_wrap` * [Unit Test / unit_tests (3.10)](https://hud.pytorch.org/pr/pytorch/torchtune/1956#32611742285) ([gh](https://github.com/pytorch/torchtune/actions/runs/11699400779/job/32611742285)) `##[error]The operation was canceled.` * [Unit Test / unit_tests (3.9)](https://hud.pytorch.org/pr/pytorch/torchtune/1956#32611741791) ([gh](https://github.com/pytorch/torchtune/actions/runs/11699400779/job/32611741791)) `##[error]The operation was canceled.`

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot commented 2 weeks ago

This pull request was exported from Phabricator. Differential Revision: D65496443

felipemello1 commented 2 weeks ago

hey @anshulverma, we used to have fsdp1 and now we use fsdp2. I don't think that we have ever seen any warning complaining about reset_parameters. Can you help me understand the context for this PR? Did something break or you found some error when running it?

ebsmothers commented 4 days ago

@anshulverma I am gonna close this PR. I may be missing the point (and if so feel free to reopen) but it seems to me that we should never have any problems here because we are always loading in RMSNorm scales as part of the pretrained checkpoint. So we will never run into any issues with garbage initialization via e.g. usage of to_empty without proper initialization