Closed hschaeufler closed 2 months ago
lr_schedule:
name: cosine_decay
warmup: 60
warmup_init: 1e-7
arguments: [1e-5, 1000, 1e-7]
So, the warmup value is the number of steps over which the LR will increase from warmup_init to the starting LR of the schedule being used (_cosinedecay in this case). The arguments to that schedule are:
decay_steps
.0
.The configuration as you have it will have the LR start at 1e-7, increase to 1e-5 after 60 steps, and then follow the Cosine annealing curve for 1000 steps, ending with a LR of 1e-7 through the rest of the training.
If you want the rate to bottom out at 1e-5 from a starting LR of 1e-4 (used here as an example, because I didn't see you mention what the initial LR value for the Cosine curve) after 130 iterations, you will need:
lr_schedule:
name: cosine_decay
warmup: 60
warmup_init: 1e-7
arguments: [1e-4, 130, 1e-5]
So, with the warmup, there are 3 phases:
If the total number of training iterations is the same as the length of phases 1 and 2 combined, then there will be no third phase.
Phase 2 is shaped like this (from A Newbie’s Guide to Stochastic Gradient Descent With Restarts ):
lr_schedule: name: cosine_decay warmup: 60 warmup_init: 1e-7 arguments: [1e-5, 1000, 1e-7]
So, the warmup value is the number of steps over which the LR will increase from _warmupinit to the starting LR of the schedule being used (_cosinedecay in this case). The arguments to that schedule are:
- init (float): Initial value.
- decay_steps (int): Number of steps to decay over. The decayed value is constant for steps beyond
decay_steps
.- end (float, optional): Final value to decay to. Default:
0
.The configuration as you have it will have the LR start at 1e-7, increase to 1e-5 after 60 steps, and then follow the Cosine annealing curve for 1000 steps, ending with a LR of 1e-7 through the rest of the training.
If you want the rate to bottom out at 1e-5 from a starting LR of 1e-4 (used here as an example, because I didn't see you mention what the initial LR value for the Cosine curve) after 130 iterations, you will need:
lr_schedule: name: cosine_decay warmup: 60 warmup_init: 1e-7 arguments: [1e-4, 130, 1e-5]
So, with the warmup, there are 3 phases:
- The warmup phase (which lasts for warmup number of steps), which linearly increases the LR from _warmupinit to whatever the initial LR for the Cosine phase is (the first argument to the cosine_decay schedule: init)
- The Cosine annealing phase, which lasts for as many steps as the second argument (decay_steps) and bottoms out at the LR specified by the third argument (end)
- The remaining (optional) phase, which keeps the LR at the value specified by the third argument for the remainder of the training
If the total number of training iterations is the same as the length of phases 1 and 2 combined, then there will be no third phase.
Phase 2 is shaped like this (from A Newbie’s Guide to Stochastic Gradient Descent With Restarts ):
Thank you very much, this has helped me a lot. Do you have any recommendations of values for Learning Rate and Eta_Min for Llama 3.1? I would use about 3% of the steps as warmup init.
I have the problem that the learning rate is reduced to the warmup_init value of 1e-7 again and not to the desired learning rate of 1e-5 after 130 iterations. Have I got something wrong in my configuration or have I misunderstood something?
My config is based on the example yaml: https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/examples/lora_config.yaml
Below is the log output.
Here is my config.
If I change the Schedule Arguments as follows: [1e-7, 60, 1e-5] the desired learning rate seems to be achieved. What also surprises me, however, is that the learning rate is only reached after 130 iterations, and not from 60 as configured with warmup.