karpathy / nn-zero-to-hero

Neural Networks: Zero to Hero
MIT License
10.9k stars 1.33k forks source link

Numerical instability in Google Colab - Part 4 of Makemore #13

Open sachag678 opened 1 year ago

sachag678 commented 1 year ago

I ran into an interesting issue in makemore 4 backpro ninja where the dhpreact was not exactly matching the hpreact.grad.

However, this was only in the collab notebook because when I put the same code into a local jupyter notebook it works fine.

Not sure why this would be the case but just an odd curiosity.

karpathy commented 1 year ago

oh oh

sachag678 commented 1 year ago

I'm guessing it has something to do with the python versions?

JonathanSum commented 1 year ago

Yes. I have an issue with colab. But I don't have an issue with the local VScode Jupyter notebook. The local Jupyter notebook version is Python 3.7.13 The tested colab notebook version is 3.7.14 (default, Sep 8 2022, 00:06:44) [GCC 7.5.0] image

image image

If the diff number is too small, maybe it is fine to use some way to accept it? Colab tested notebook: https://colab.research.google.com/drive/1HmZ8bgtAfvyMaZyu3Sr1Bgxsj35jitTs?usp=sharing

Maybe the issue is Pytorch version?

JonathanSum commented 1 year ago

I used the t.grad.sum() and dt.sum() to compare the sum between colab and the local notebook. colab.txt local.txt

I posted it on Pytorch forum, and I got no answer: https://discuss.pytorch.org/t/numerical-instability-in-google-colab/163610 I am planning to post it on Colab Git Issues.

mriganktiwari commented 1 year ago

Yes. I have an issue with colab. But I don't have an issue with the local VScode Jupyter notebook. The local Jupyter notebook version is Python 3.7.13 The tested colab notebook version is 3.7.14 (default, Sep 8 2022, 00:06:44) [GCC 7.5.0] image

image image

If the diff number is too small, maybe it is fine to use some way to accept it? Colab tested notebook: https://colab.research.google.com/drive/1HmZ8bgtAfvyMaZyu3Sr1Bgxsj35jitTs?usp=sharing

Maybe the issue is Pytorch version?

I am getting exactly same maxdiff for hpreact, and my notebook is running on local machine. Python 3.9.13

&

torch.version '1.12.1'

evgenyfadeev commented 1 year ago

I've got a strange observation (using the colab version)

dlogit_maxes = - dnorm_logits.sum(dim=1, keepdim=True) gives me exact equality dlogit_maxes = - dnorm_logits.sum(dim=1) gives approximate equality with a maxdiff ~ 10^-8

In this exapmple - if shapes of the gradients are not equal, but the comparison is made after broadcasting (I guess) - there is a residual difference, otherwise the values equal exactly. Somehow it might have to do with the accuracy limitations of the floating point operations. In this case values are float32 and 10^-8 is close to the precision limit for float32 operations.

I've made a PR for the cmp function to output comparison of shapes, it could probably be useful: https://github.com/karpathy/nn-zero-to-hero/pull/36

Another thing is that maybe what matters is the order of the arithmetic operations. Apparently addition and multiplications of the floats are not associative https://pytorch.org/docs/stable/notes/numerical_accuracy.html

Also the doc says that there results my be inconsistent across devices, and commits in the software.

vdyma commented 2 months ago

I had the same difference problem between gradients when running locally, because I used GPU to store tensors and perform computations. Once I changed to CPU, I had the difference in the later computations because of the ordering of operations. I managed to get the exact gradients running on CPU and reordering computations to be the same as in the lecture.