ml-explore / mlx-examples

Examples in the MLX framework
MIT License
6.3k stars 898 forks source link

Learning rate approaches warmup_init value #985

Closed hschaeufler closed 2 months ago

hschaeufler commented 2 months ago

I have the problem that the learning rate is reduced to the warmup_init value of 1e-7 again and not to the desired learning rate of 1e-5 after 130 iterations. Have I got something wrong in my configuration or have I misunderstood something?

My config is based on the example yaml: https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/examples/lora_config.yaml

Below is the log output.

> mlx_lm.lora --config "fine_tuning/lora_config.yaml"
Loading configuration file fine_tuning/lora_config.yaml
Loading pretrained model
Fetching 11 files: 100%|████████████████████| 11/11 [00:00<00:00, 157465.34it/s]
Loading datasets
Training
Trainable parameters: 0.261% (20.972M/8030.261M)
Starting training..., iters: 5145
Iter 1: Val loss 0.709, Val took 135.128s
Iter 10: Train loss 0.972, Learning Rate 1.585e-06, It/sec 0.126, Tokens/sec 204.878, Trained Tokens 16254, Peak mem 36.354 GB
Iter 20: Train loss 0.696, Learning Rate 3.235e-06, It/sec 0.076, Tokens/sec 192.299, Trained Tokens 41623, Peak mem 46.288 GB
Iter 30: Train loss 0.781, Learning Rate 4.885e-06, It/sec 0.098, Tokens/sec 213.856, Trained Tokens 63390, Peak mem 46.288 GB
Iter 40: Train loss 0.800, Learning Rate 6.535e-06, It/sec 0.091, Tokens/sec 201.650, Trained Tokens 85601, Peak mem 46.288 GB
Iter 50: Train loss 0.732, Learning Rate 8.185e-06, It/sec 0.089, Tokens/sec 205.768, Trained Tokens 108642, Peak mem 48.962 GB
Iter 60: Train loss 0.617, Learning Rate 9.835e-06, It/sec 0.089, Tokens/sec 216.993, Trained Tokens 133092, Peak mem 48.962 GB
Iter 70: Train loss 0.558, Learning Rate 9.599e-06, It/sec 0.062, Tokens/sec 210.982, Trained Tokens 167041, Peak mem 88.648 GB
Iter 80: Train loss 0.607, Learning Rate 8.080e-06, It/sec 0.104, Tokens/sec 226.268, Trained Tokens 188765, Peak mem 88.648 GB
[WARNING] Some sequences are longer than 8096 tokens. The longest sentence 11335 will be truncated to 8096. Consider pre-splitting your data to save memory.
Iter 90: Train loss 0.653, Learning Rate 5.800e-06, It/sec 0.034, Tokens/sec 144.239, Trained Tokens 230838, Peak mem 136.723 GB
Iter 100: Train loss 0.500, Learning Rate 3.331e-06, It/sec 0.088, Tokens/sec 218.473, Trained Tokens 255571, Peak mem 136.723 GB
Iter 100: Saved adapter weights to adapters/adapters.safetensors and adapters/0000100_adapters.safetensors.
Iter 110: Train loss 0.438, Learning Rate 1.294e-06, It/sec 0.073, Tokens/sec 217.987, Trained Tokens 285236, Peak mem 136.723 GB
Iter 120: Train loss 0.487, Learning Rate 2.013e-07, It/sec 0.075, Tokens/sec 224.668, Trained Tokens 315286, Peak mem 136.723 GB
Iter 130: Train loss 0.445, Learning Rate 1.000e-07, It/sec 0.068, Tokens/sec 217.844, Trained Tokens 347202, Peak mem 136.723 GB
Iter 140: Train loss 0.493, Learning Rate 1.000e-07, It/sec 0.071, Tokens/sec 214.961, Trained Tokens 377388, Peak mem 136.723 GB

Here is my config.

# Source: https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/examples/lora_config.yaml
model: "meta-llama/Meta-Llama-3.1-8B-Instruct"
train: true
# Directory with {train, valid, test}.jsonl files
data: "fine_tuning/"
# The PRNG seed
seed: 0
# Number of layers to fine-tune
lora_layers: 16
# Minibatch size.
batch_size: 2
# see: https://github.com/ml-explore/mlx/discussions/728
# Iterations to train for.
# iters = (len(train_set) / batch_size) * epochs
# 5145 = 2058 / 2 * 5
iters: 5145
# Number of validation batches, -1 uses the entire validation set.
# 25 seems to be the default
val_batches: 25
# Adam learning rate.
learning_rate: 1e-5
# Number of training steps between loss reporting.
# default: 10
steps_per_report: 10
# Number of training steps between validations.
# default: 200
steps_per_eval: 200
# Load path to resume training with the given adapter weights.
resume_adapter_file: null
# Save/load path for the trained adapter weights.
adapter_path: "adapters"
# Save the model every N iterations.
# default: 100
save_every: 100
# Evaluate on the test set after training
test: false
# Number of test set batches, -1 uses the entire test set.
test_batches: 100
# Maximum sequence length.
max_seq_length: 8096
# Use gradient checkpointing to reduce memory use.
grad_checkpoint: true
# Use DoRA instead of LoRA.
use_dora: false
# LoRA parameters can only be specified in a config file
lora_parameters:
  keys: [
    "self_attn.q_proj",
    "self_attn.v_proj",
    "self_attn.k_proj",
    "self_attn.o_proj",
    "mlp.gate_proj",
    "mlp.down_proj",
    "mlp.up_proj"
  ]
  scale: 0.5
  rank: 16
# alpha will be calculated with rank * scale
# alpha: 8
  dropout: 0.05
lr_schedule:
  name: cosine_decay
  # steps * 0,03 % warmup rate
  # Biderman et. al (2024)
  warmup: 60 # 0 for no warmup
  warmup_init: 1e-7
  arguments: [1e-5, 1000, 1e-7] # passed to scheduler

If I change the Schedule Arguments as follows: [1e-7, 60, 1e-5] the desired learning rate seems to be achieved. What also surprises me, however, is that the learning rate is only reached after 130 iterations, and not from 60 as configured with warmup.

Iter 10: Train loss 0.972, Learning Rate 1.000e-07, It/sec 0.141, Tokens/sec 228.584, Trained Tokens 16254, Peak mem 36.354 GB
Iter 20: Train loss 0.697, Learning Rate 1.000e-07, It/sec 0.086, Tokens/sec 218.702, Trained Tokens 41623, Peak mem 46.288 GB
Iter 30: Train loss 0.790, Learning Rate 1.000e-07, It/sec 0.104, Tokens/sec 226.158, Trained Tokens 63390, Peak mem 46.288 GB
Iter 40: Train loss 0.847, Learning Rate 1.000e-07, It/sec 0.099, Tokens/sec 219.556, Trained Tokens 85601, Peak mem 46.288 GB
Iter 50: Train loss 0.854, Learning Rate 1.000e-07, It/sec 0.096, Tokens/sec 221.054, Trained Tokens 108642, Peak mem 48.962 GB
Iter 60: Train loss 0.782, Learning Rate 1.000e-07, It/sec 0.091, Tokens/sec 223.587, Trained Tokens 133092, Peak mem 48.962 GB
Iter 70: Train loss 0.750, Learning Rate 5.279e-07, It/sec 0.063, Tokens/sec 214.208, Trained Tokens 167041, Peak mem 88.648 GB
Iter 80: Train loss 0.847, Learning Rate 2.140e-06, It/sec 0.104, Tokens/sec 226.501, Trained Tokens 188765, Peak mem 88.648 GB
[WARNING] Some sequences are longer than 8096 tokens. The longest sentence 11335 will be truncated to 8096. Consider pre-splitting your data to save memory.
Iter 90: Train loss 0.866, Learning Rate 4.533e-06, It/sec 0.032, Tokens/sec 133.859, Trained Tokens 230838, Peak mem 136.723 GB
Iter 100: Train loss 0.761, Learning Rate 7.063e-06, It/sec 0.087, Tokens/sec 214.779, Trained Tokens 255571, Peak mem 136.723 GB
Iter 100: Saved adapter weights to adapters/adapters.safetensors and adapters/0000100_adapters.safetensors.
Iter 110: Train loss 0.642, Learning Rate 9.055e-06, It/sec 0.073, Tokens/sec 215.079, Trained Tokens 285236, Peak mem 136.723 GB
Iter 120: Train loss 0.597, Learning Rate 9.973e-06, It/sec 0.074, Tokens/sec 221.571, Trained Tokens 315286, Peak mem 136.723 GB
Iter 130: Train loss 0.533, Learning Rate 1.000e-05, It/sec 0.067, Tokens/sec 214.864, Trained Tokens 347202, Peak mem 136.723 GB
Iter 140: Train loss 0.559, Learning Rate 1.000e-05, It/sec 0.070, Tokens/sec 211.840, Trained Tokens 377388, Peak mem 136.723 GB
Iter 150: Train loss 0.521, Learning Rate 1.000e-05, It/sec 0.080, Tokens/sec 220.647, Trained Tokens 404935, Peak mem 136.723 GB
5.279eIter 160: Train loss 0.565, Learning Rate 1.000e-05, It/sec 0.094, Tokens/sec 216.510, Trained Tokens 427956, Peak mem 136.723 GB
Iter 170: Train loss 0.559, Learning Rate 1.000e-05, It/sec 0.100, Tokens/sec 217.131, Trained Tokens 449566, Peak mem 136.723 GB
Iter 180: Train loss 0.444, Learning Rate 1.000e-05, It/sec 0.057, Tokens/sec 209.930, Trained Tokens 486432, Peak mem 136.723 GB
[WARNING] Some sequences are longer than 8096 tokens. The longest sentence 9646 will be truncated to 8096. Consider pre-splitting your data to save memory.
Iter 190: Train loss 0.463, Learning Rate 1.000e-05, It/sec 0.034, Tokens/sec 137.806, Trained Tokens 526878, Peak mem 136.723 GB
chimezie commented 2 months ago
lr_schedule:
  name: cosine_decay
  warmup: 60 
  warmup_init: 1e-7
  arguments: [1e-5, 1000, 1e-7] 

So, the warmup value is the number of steps over which the LR will increase from warmup_init to the starting LR of the schedule being used (_cosinedecay in this case). The arguments to that schedule are:

The configuration as you have it will have the LR start at 1e-7, increase to 1e-5 after 60 steps, and then follow the Cosine annealing curve for 1000 steps, ending with a LR of 1e-7 through the rest of the training.

If you want the rate to bottom out at 1e-5 from a starting LR of 1e-4 (used here as an example, because I didn't see you mention what the initial LR value for the Cosine curve) after 130 iterations, you will need:

lr_schedule:
  name: cosine_decay
  warmup: 60 
  warmup_init: 1e-7
  arguments: [1e-4, 130, 1e-5] 

So, with the warmup, there are 3 phases:

  1. The warmup phase (which lasts for warmup number of steps), which linearly increases the LR from warmup_init to whatever the initial LR for the Cosine phase is (the first argument to the cosine_decay schedule: init)
  2. The Cosine annealing phase, which lasts for as many steps as the second argument (decay_steps) and bottoms out at the LR specified by the third argument (end)
  3. The remaining (optional) phase, which keeps the LR at the value specified by the third argument for the remainder of the training

If the total number of training iterations is the same as the length of phases 1 and 2 combined, then there will be no third phase.

Phase 2 is shaped like this (from A Newbie’s Guide to Stochastic Gradient Descent With Restarts ):

image

hschaeufler commented 2 months ago
lr_schedule:
  name: cosine_decay
  warmup: 60 
  warmup_init: 1e-7
  arguments: [1e-5, 1000, 1e-7] 

So, the warmup value is the number of steps over which the LR will increase from _warmupinit to the starting LR of the schedule being used (_cosinedecay in this case). The arguments to that schedule are:

  • init (float): Initial value.
  • decay_steps (int): Number of steps to decay over. The decayed value is constant for steps beyond decay_steps.
  • end (float, optional): Final value to decay to. Default: 0.

The configuration as you have it will have the LR start at 1e-7, increase to 1e-5 after 60 steps, and then follow the Cosine annealing curve for 1000 steps, ending with a LR of 1e-7 through the rest of the training.

If you want the rate to bottom out at 1e-5 from a starting LR of 1e-4 (used here as an example, because I didn't see you mention what the initial LR value for the Cosine curve) after 130 iterations, you will need:

lr_schedule:
  name: cosine_decay
  warmup: 60 
  warmup_init: 1e-7
  arguments: [1e-4, 130, 1e-5] 

So, with the warmup, there are 3 phases:

  1. The warmup phase (which lasts for warmup number of steps), which linearly increases the LR from _warmupinit to whatever the initial LR for the Cosine phase is (the first argument to the cosine_decay schedule: init)
  2. The Cosine annealing phase, which lasts for as many steps as the second argument (decay_steps) and bottoms out at the LR specified by the third argument (end)
  3. The remaining (optional) phase, which keeps the LR at the value specified by the third argument for the remainder of the training

If the total number of training iterations is the same as the length of phases 1 and 2 combined, then there will be no third phase.

Phase 2 is shaped like this (from A Newbie’s Guide to Stochastic Gradient Descent With Restarts ):

image

Thank you very much, this has helped me a lot. Do you have any recommendations of values for Learning Rate and Eta_Min for Llama 3.1? I would use about 3% of the steps as warmup init.