karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
37.49k stars 5.97k forks source link

Pretraining loss explosion #554

Open mattgorb opened 2 months ago

mattgorb commented 2 months ago

I have been trying to get this repo working for several months, but my loss keeps exploding between 30k and 100k iterations.

I have tried many things: Turn flash attention off ( based on this issue: https://github.com/karpathy/nanoGPT/issues/524) Using fp16 (based in this: https://github.com/karpathy/nanoGPT/issues/468) Using GPT-4 tokenizer (based on https://github.com/karpathy/nanoGPT/issues/468)

At first the loss was going back up to about 8-10, now it is just going to NaN with fp16.

I have also tinkered with other setting such as gradient clipping, learning rate, etc. I keep my configuration at roughly 500k batch size.

I am lost on what to try next. Did anyone else fix this issue?

I have gotten GPT-2 Small down to about 3.0 loss.

HarrisonUnifyAI commented 3 days ago

My loss also exploded on an initial attempt with 8 A100 instance. I disabled flash attention and that seems to have fixed it, though it is much slower.