Closed Haijunlv closed 1 year ago
For the 13B model, we use half the learning rate (1.5e-4) and half the batch size (2M tokens) as the 7B model.
thx for your reply!
hi @Haijunlv, i wonder if you solved the problem after half the LR and BS? Or have u found any other solutions?
Thanks for your nice work!
And i also would like to reproduce llama-13b pretrain. But i occured some loss spikes at very begging step (in warm up steps). I tried some method to reduce the spike, And finally found reducing max lr to 1.2e-4 can smooth the loss curve. So what is the max lr in your open llama-13bv1? Did you use the same training hyperparameters as the original LLaMA paper ? And did you occur some loss spike at very begging stage?
i used hyperparameters in paper.
and other possible hyperparameters,
model architecture and tokenizer can be confirmed due to previous experiment in continual pretrain. dataset use RedPajama.