Closed gordicaleksa closed 1 week ago
@gordicaleksa - this appears to be causing a failure when running:
make testgpt2_cu USE_CUDNN=1 && ./testgpt2_cu
It may or may not be seen on your environment but am able to see it here.
@karpathy - I think this is what was causing the issue. I ran my tests one commit back from this in your repo and it passes consistently. If I checkout this commit, then the failures start. Not 100 percent sure but it does make some sense since this is the test that's failing and there aren't many other changes to this file recently?
Can you confirm since you were seeing the failure consistently too? Thank you.
It's certainly this PR - sad our CI didn't catch this! See https://github.com/karpathy/llm.c/pull/615 for a fix.
Regarding grad tensors: back when Andrej hardcoded the thresholds we had a bug in PyTorch that led to a bigger discrepancy between our PT vs C code - now that that's fixed we can be really strict and use
1e-6f
here.