pytorch / torchtune

PyTorch native finetuning library
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.35k stars 440 forks source link

Add LR Scheduler to full finetune distributed #2017

Closed parthsarthi03 closed 2 days ago

parthsarthi03 commented 5 days ago

Context

What is the purpose of this PR? Is it to

Please link to any issues this PR addresses: #1308

Purpose of this PR:

This PR adds support for an optional learning rate scheduler to the FullFinetuneRecipeDistributed class, allowing users to configure and use a learning rate scheduler during training.

You can enable it by adding the following to your config file:

lr_scheduler:
  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
  num_warmup_steps: 50

Changelog

What are the changes made in this PR?

Test plan

Tested on 4 GPUs with various configurations: https://wandb.ai/psarthi/torchtune_lr_scheduler_tests:

  1. No Learning Rate Scheduler, No Optimizer-in-Backward: https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/e1ddni13
  2. No Learning Rate Scheduler, With Optimizer-in-Backward: https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/km2jw6rs
  3. Cosine Learning Rate Scheduler with 50 Warmup Steps, With Optimizer-in-Backward: https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/lfacg1b8
  4. Cosine Learning Rate Scheduler with 50 Warmup Steps, Without Optimizer-in-Backward: https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/ymktfbam
  5. Resuming Training with Learning Rate Scheduler, Without Optimizer-in-Backward: https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/ckia4yzi
pytorch-bot[bot] commented 5 days ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2017

Note: Links to docs will display an error until the docs builds have been completed.

:heavy_exclamation_mark: 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

:white_check_mark: No Failures

As of commit cfd2eb4a3e0a9bb03fc3e71483822947eb6db5b5 with merge base 0c31907a20c6f031c9b891fe1968c7cc69742eeb (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1 commented 3 days ago

thanks for the PR! I glanced over it and it looks great! I will review it more carefully tomorrow and merge it i dont find any issues :)

gordicaleksa commented 2 days ago

Consider refactoring (extracting into a separate file) because this same setup function is used in full_finetune_single_device.py (https://github.com/pytorch/torchtune/blob/main/recipes/full_finetune_single_device.py#L496)

Eventually they'll fall out of sync.

cc: @felipemello1

(i've hit this same issue and was about to submit a PR but noticed this one :))

gordicaleksa commented 2 days ago

Also might be worthwhile adding something like:

lr_scheduler:
  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
  num_warmup_steps: 10

to configs, e.g. for llama 3.1 (8b/70b) better than not being sure what scheduler is being used

felipemello1 commented 2 days ago

@gordicaleksa , great point! We are currently having some internal discussions about what should be exposed in the recipe and what should be a utility. In general, we are ok with repeating code so it is easy for people to hack and make their changes. But there are use cases like this one that seems to be pretty standard and really don't add much value by being exposed. We will work on making our recipes a bit learner soon.