Open Nandan91 opened 7 months ago
Thanks for your reply.
The training configurations you referred to seem configured for the 600K training steps. As mentioned in the paper, you ran for 50K iterations to train only on 2 B tokens (the eval PPL you got is 3). Did you change anything else, such as learning rate, weight decay, etc.?
I trained for 50K iterations; however, my val loss remained ~3 (PPL >30).
No, i did not change anything such as learning rate or weight decay, I recall that my number is around those reported in the original nanogpt repo (https://github.com/karpathy/nanoGPT?tab=readme-ov-file#baselines).
Hi ! Interesting work on the role of explicit bias!
I was wondering what training settings got you an eval PPL ~3.04. The paper mentions that 50K iterations are required to train the GPT-2 model on 2B tokens. What was the bacth_size_per_device and block_size for the same? Did you do training from scratch or fine-tune the pre-trained model (trained on 300B tokens)?
Thanks!