openlm-research / open_llama

OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset
Apache License 2.0
7.36k stars 374 forks source link

open llama 13b training hyperparameter question #74

Closed Haijunlv closed 1 year ago

Haijunlv commented 1 year ago

Thanks for your nice work!
And i also would like to reproduce llama-13b pretrain. But i occured some loss spikes at very begging step (in warm up steps). I tried some method to reduce the spike, And finally found reducing max lr to 1.2e-4 can smooth the loss curve. So what is the max lr in your open llama-13bv1? Did you use the same training hyperparameters as the original LLaMA paper ? And did you occur some loss spike at very begging stage?

image

i used hyperparameters in paper.

  1. 2000 warmup
  2. cosine learning rate schedule, max lr = 3e-4
  3. AdamW beta1=0.9, beat2=0.95, weight_decay=0.1, clip_grad=1.0
  4. seqence length = 2048
  5. GLOBAL_BATCH_SIZE= 2048, # means batch size tokens = 2048*2048 = 4M tokens

and other possible hyperparameters,

  1. init_method_std=0.008, # like palm-540b, sqrt(1./(NHIDDEN*3))
  2. no weight decay on embedding and rms norm

model architecture and tokenizer can be confirmed due to previous experiment in continual pretrain. dataset use RedPajama.

young-geng commented 1 year ago

For the 13B model, we use half the learning rate (1.5e-4) and half the batch size (2M tokens) as the 7B model.

Haijunlv commented 1 year ago

thx for your reply!

keyu-tian commented 10 months ago

hi @Haijunlv, i wonder if you solved the problem after half the LR and BS? Or have u found any other solutions?