Getting "Floating point exception (core dumped)" Error

karpathy / llm.c

LLM training in simple, raw C/CUDA

MIT License

22.74k stars 2.54k forks source link

Getting "Floating point exception (core dumped)" Error #687

Open alvins82 opened 1 month ago

alvins82 commented 1 month ago

Playing around with re-creating GPT-2 from the thread. When I run train I get the above error. Screenshot below.

diegoasua commented 1 month ago

No idea what is going on but maybe try compiling without CuDNN make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass

# fp32 test (cudnn not supported)
make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu
# mixed precision cudnn test
make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu

alvins82 commented 1 month ago

No idea what is going on but maybe try compiling without CuDNN make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass
# fp32 test (cudnn not supported)
make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu
# mixed precision cudnn test
make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu

Both of the tests pass. Also put a screenshot of my torch versions above.

alvins82 commented 1 month ago

No idea what is going on but maybe try compiling without CuDNN make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass
# fp32 test (cudnn not supported)
make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu
# mixed precision cudnn test
make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu

gordicaleksa commented 1 month ago

eyeballing your cmdline i'd say your batch size is too small and is causing an exception in the hellaswag eval, this is a known issue and we have a patch merged into master that basically forces you to use batch size >= 4