Open mattgorb opened 2 months ago
I have been trying to get this repo working for several months, but my loss keeps exploding between 30k and 100k iterations.
I have tried many things: Turn flash attention off ( based on this issue: https://github.com/karpathy/nanoGPT/issues/524) Using fp16 (based in this: https://github.com/karpathy/nanoGPT/issues/468) Using GPT-4 tokenizer (based on https://github.com/karpathy/nanoGPT/issues/468)
At first the loss was going back up to about 8-10, now it is just going to NaN with fp16.
I have also tinkered with other setting such as gradient clipping, learning rate, etc. I keep my configuration at roughly 500k batch size.
I am lost on what to try next. Did anyone else fix this issue?
I have gotten GPT-2 Small down to about 3.0 loss.
My loss also exploded on an initial attempt with 8 A100 instance. I disabled flash attention and that seems to have fixed it, though it is much slower.
I have been trying to get this repo working for several months, but my loss keeps exploding between 30k and 100k iterations.
I have tried many things: Turn flash attention off ( based on this issue: https://github.com/karpathy/nanoGPT/issues/524) Using fp16 (based in this: https://github.com/karpathy/nanoGPT/issues/468) Using GPT-4 tokenizer (based on https://github.com/karpathy/nanoGPT/issues/468)
At first the loss was going back up to about 8-10, now it is just going to NaN with fp16.
I have also tinkered with other setting such as gradient clipping, learning rate, etc. I keep my configuration at roughly 500k batch size.
I am lost on what to try next. Did anyone else fix this issue?
I have gotten GPT-2 Small down to about 3.0 loss.