Training only on 2B tokens (openwebtext)

locuslab / massive-activations

Code accompanying the paper "Massive Activations in Large Language Models"

https://arxiv.org/abs/2402.17762

MIT License

121 stars 8 forks source link

Training only on 2B tokens (openwebtext) #5

Open Nandan91 opened 7 months ago

Nandan91 commented 7 months ago

Hi ! Interesting work on the role of explicit bias!

I was wondering what training settings got you an eval PPL ~3.04. The paper mentions that 50K iterations are required to train the GPT-2 model on 2B tokens. What was the bacth_size_per_device and block_size for the same? Did you do training from scratch or fine-tune the pre-trained model (trained on 300B tokens)?

Thanks!

Eric-mingjie commented 7 months ago

Hi, Thanks for your interest in our work.

The training config is shown here, which i think will be automatically divided by the number of GPUs available (here).

We do not perform any fine-tuning but instead train all the GPT-2 models from scratch.

Nandan91 commented 6 months ago

Thanks for your reply.

The training configurations you referred to seem configured for the 600K training steps. As mentioned in the paper, you ran for 50K iterations to train only on 2 B tokens (the eval PPL you got is 3). Did you change anything else, such as learning rate, weight decay, etc.?

I trained for 50K iterations; however, my val loss remained ~3 (PPL >30).

Eric-mingjie commented 6 months ago

No, i did not change anything such as learning rate or weight decay, I recall that my number is around those reported in the original nanogpt repo (https://github.com/karpathy/nanoGPT?tab=readme-ov-file#baselines).