open llama 13b training hyperparameter question

Haijunlv commented 1 year ago

Thanks for your nice work!
And i also would like to reproduce llama-13b pretrain. But i occured some loss spikes at very begging step (in warm up steps). I tried some method to reduce the spike, And finally found reducing max lr to 1.2e-4 can smooth the loss curve. So what is the max lr in your open llama-13bv1? Did you use the same training hyperparameters as the original LLaMA paper ? And did you occur some loss spike at very begging stage?

i used hyperparameters in paper.

2000 warmup
cosine learning rate schedule, max lr = 3e-4
AdamW beta1=0.9, beat2=0.95, weight_decay=0.1, clip_grad=1.0
seqence length = 2048
GLOBAL_BATCH_SIZE= 2048, # means batch size tokens = 2048*2048 = 4M tokens

and other possible hyperparameters,

init_method_std=0.008, # like palm-540b, sqrt(1./(NHIDDEN*3))
no weight decay on embedding and rms norm

model architecture and tokenizer can be confirmed due to previous experiment in continual pretrain. dataset use RedPajama.

young-geng commented 1 year ago

For the 13B model, we use half the learning rate (1.5e-4) and half the batch size (2M tokens) as the 7B model.

Haijunlv commented 1 year ago

thx for your reply!

keyu-tian commented 10 months ago

hi @Haijunlv, i wonder if you solved the problem after half the LR and BS? Or have u found any other solutions?

openlm-research / open_llama

open llama 13b training hyperparameter question #74