karpathy / llm.c

LLM training in simple, raw C/CUDA
MIT License
21.28k stars 2.31k forks source link

Stricter FP32 tests #614

Closed gordicaleksa closed 1 week ago

gordicaleksa commented 1 week ago

Regarding grad tensors: back when Andrej hardcoded the thresholds we had a bug in PyTorch that led to a bigger discrepancy between our PT vs C code - now that that's fixed we can be really strict and use 1e-6f here.

rosslwheeler commented 1 week ago

@gordicaleksa - this appears to be causing a failure when running:

make testgpt2_cu USE_CUDNN=1 && ./testgpt2_cu

It may or may not be seen on your environment but am able to see it here.

@karpathy - I think this is what was causing the issue. I ran my tests one commit back from this in your repo and it passes consistently. If I checkout this commit, then the failures start. Not 100 percent sure but it does make some sense since this is the test that's failing and there aren't many other changes to this file recently?

Can you confirm since you were seeing the failure consistently too? Thank you.

gordicaleksa commented 1 week ago

It's certainly this PR - sad our CI didn't catch this! See https://github.com/karpathy/llm.c/pull/615 for a fix.