pytorch / torchtune

PyTorch native finetuning library
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.39k stars 448 forks source link

[DO NOT LAND] compile more modules #1938

Open felipemello1 opened 4 weeks ago

felipemello1 commented 4 weeks ago

Context

What is the purpose of this PR? Is it to

We compile only transformer layers. However, we could compile embedding, norm and the output layer.

if hasattr(model, "norm"):
    model.norm.compile(backend=backend)

if hasattr(model, "chunked_output"):
    model.chunked_output = torch.compile(model.chunked_output, backend=backend)

if hasattr(model, "token_embeddings"):
    model.token_embeddings.compile(backend=backend)

Test plan

3b with packing

tune run full_finetune_single_device --config llama3_2/3B_full_single_device optimizer_in_bwd=True enable_activation_checkpointing=True enable_activation_offloading=True optimizer._component_=torch.optim.AdamW optimizer.fused=True compile=True dataset.packed=True dataset.split=train[:5%] tokenizer.max_seq_len=2048 metric_logger=torchtune.training.metric_logging.WandBLogger metric_logger.project=profiling log_every_n_steps=1 log_peak_memory_stats=True gradient_accumulation_steps=1 max_steps_per_epoch=15 epochs=1 batch_size=5 metric_logger.name=baseline loss=torchtune.modules.loss.CEWithChunkedOutputLoss
image image

8b with packing

image

11b NO packing, NO act offloading

image

conclusion:

compiling the extra modules seems to help when there is tied embedding. However, if there is not packing, then there are more graph breaks, slowing down early training. We should fix graphs breaks and then potentially land this PR. Optionally, we can compile the extra layers only if we hav tied embeddings.

pytorch-bot[bot] commented 4 weeks ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1938

Note: Links to docs will display an error until the docs builds have been completed.

:x: 2 New Failures, 4 Cancelled Jobs

As of commit edfa1889e84d6555cb8d75e27de3d5fe03d76b26 with merge base eab21f07065574d883b3ec7620a55f1d92f67c8a (image):

NEW FAILURES - The following jobs have failed:

* [GPU tests / gpu_test (3.11, stable)](https://hud.pytorch.org/pr/pytorch/torchtune/1938#32371649515) ([gh](https://github.com/pytorch/torchtune/actions/runs/11623888918/job/32371649515)) `tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceRecipe::test_training_state_on_resume` * [Recipe Tests / recipe_test (3.11)](https://hud.pytorch.org/pr/pytorch/torchtune/1938#32371646927) ([gh](https://github.com/pytorch/torchtune/actions/runs/11623888923/job/32371646927)) `tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceRecipe::test_training_state_on_resume`

CANCELLED JOBS - The following jobs were cancelled. Please retry:

* [GPU tests / gpu_test (3.10, stable)](https://hud.pytorch.org/pr/pytorch/torchtune/1938#32371649374) ([gh](https://github.com/pytorch/torchtune/actions/runs/11623888918/job/32371649374)) * [GPU tests / gpu_test (3.9, stable)](https://hud.pytorch.org/pr/pytorch/torchtune/1938#32371649215) ([gh](https://github.com/pytorch/torchtune/actions/runs/11623888918/job/32371649215)) `##[error]The operation was canceled.` * [Recipe Tests / recipe_test (3.10)](https://hud.pytorch.org/pr/pytorch/torchtune/1938#32371646778) ([gh](https://github.com/pytorch/torchtune/actions/runs/11623888923/job/32371646778)) `##[error]The operation was canceled.` * [Recipe Tests / recipe_test (3.9)](https://hud.pytorch.org/pr/pytorch/torchtune/1938#32371646644) ([gh](https://github.com/pytorch/torchtune/actions/runs/11623888923/job/32371646644))

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1 commented 4 weeks ago

compiling the chunked_output will break for tied embeddings + fsdp