Open felipemello1 opened 4 weeks ago
Note: Links to docs will display an error until the docs builds have been completed.
As of commit edfa1889e84d6555cb8d75e27de3d5fe03d76b26 with merge base eab21f07065574d883b3ec7620a55f1d92f67c8a ():
* [GPU tests / gpu_test (3.11, stable)](https://hud.pytorch.org/pr/pytorch/torchtune/1938#32371649515) ([gh](https://github.com/pytorch/torchtune/actions/runs/11623888918/job/32371649515)) `tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceRecipe::test_training_state_on_resume` * [Recipe Tests / recipe_test (3.11)](https://hud.pytorch.org/pr/pytorch/torchtune/1938#32371646927) ([gh](https://github.com/pytorch/torchtune/actions/runs/11623888923/job/32371646927)) `tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceRecipe::test_training_state_on_resume`
* [GPU tests / gpu_test (3.10, stable)](https://hud.pytorch.org/pr/pytorch/torchtune/1938#32371649374) ([gh](https://github.com/pytorch/torchtune/actions/runs/11623888918/job/32371649374)) * [GPU tests / gpu_test (3.9, stable)](https://hud.pytorch.org/pr/pytorch/torchtune/1938#32371649215) ([gh](https://github.com/pytorch/torchtune/actions/runs/11623888918/job/32371649215)) `##[error]The operation was canceled.` * [Recipe Tests / recipe_test (3.10)](https://hud.pytorch.org/pr/pytorch/torchtune/1938#32371646778) ([gh](https://github.com/pytorch/torchtune/actions/runs/11623888923/job/32371646778)) `##[error]The operation was canceled.` * [Recipe Tests / recipe_test (3.9)](https://hud.pytorch.org/pr/pytorch/torchtune/1938#32371646644) ([gh](https://github.com/pytorch/torchtune/actions/runs/11623888923/job/32371646644))
This comment was automatically generated by Dr. CI and updates every 15 minutes.
compiling the chunked_output will break for tied embeddings + fsdp
Context
What is the purpose of this PR? Is it to
We compile only transformer layers. However, we could compile embedding, norm and the output layer.
Test plan
3b with packing
8b with packing
11b NO packing, NO act offloading
conclusion:
compiling the extra modules seems to help when there is tied embedding. However, if there is not packing, then there are more graph breaks, slowing down early training. We should fix graphs breaks and then potentially land this PR. Optionally, we can compile the extra layers only if we hav tied embeddings.