inf loss at big batch - Githubissues

karpathy / llm.c

LLM training in simple, raw C/CUDA

MIT License

23.31k stars 2.59k forks source link

Open karpathy opened 4 months ago

karpathy commented 4 months ago

just creating a todo. large batch sizes work now having fixed the size_t bug:

./train_gpt2cu -b 36 -v 200 -s 200 -i data/TinyStories

works, but 48 should fit but doesn't work

./train_gpt2cu -b 48 -v 200 -s 200 -i data/TinyStories

val loss is -nan and train loss stays at inf.

todo track down why and how to prevent

ngc92 commented 3 months ago

@karpathy just wanted to check, we've fixed this, right?