Multiplying a very large CUDA tensor with another tensor yields unexpected result

dschaehi commented 5 years ago

🐛 Bug

Multiplying a very large CUDA tensor with another tensor yields unexpected result.

To Reproduce

Steps to reproduce the behavior:

Generate the following random matrices

A = torch.randn((11111111, 20), device=torch.device("cuda"))
B = torch.randn((20, 2), device=torch.device("cuda"))

Then (A @ B)[8807984:] must be the same as A[8807984:] @ B. But it is not the case!

Minimal example:

A = torch.randn((11111111, 20), device=torch.device("cuda"))
B = torch.randn((20, 2), device=torch.device("cuda"))
print((A @ B)[8807984:].equal(A[8807984:] @ B))

returns False

Expected behavior

A = torch.randn((11111111, 20), device=torch.device("cuda"))
B = torch.randn((20, 2), device=torch.device("cuda"))
print((A @ B)[8807984:].equal(A[8807984:] @ B))

Should return True

Environment

PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 18.04.2 LTS GCC version: (Homebrew gcc 5.5.0_4) 5.5.0 CMake version: Could not collect

Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 9.1.85 GPU models and configuration: GPU 0: GeForce GTX 1080 Ti Nvidia driver version: 390.116 cuDNN version: Could not collect

Versions of relevant libraries: [pip] numpy==1.17.0 [pip] pytorch-lightning==0.3.6.9 [pip] torch==1.1.0 [pip] torchkge==0.10.3 [pip] torchvision==0.3.0 [conda] blas 2.10 mkl conda-forge [conda] libblas 3.8.0 10_mkl conda-forge [conda] libcblas 3.8.0 10_mkl conda-forge [conda] liblapack 3.8.0 10_mkl conda-forge [conda] liblapacke 3.8.0 10_mkl conda-forge [conda] mkl 2019.4 243 [conda] pytorch 1.1.0 py3.7_cuda9.0.176_cudnn7.5.1_0 pytorch [conda] pytorch-lightning 0.3.6.9 pypi_0 pypi [conda] torchkge 0.10.3 dev_0 [conda] torchvision 0.3.0 py37_cu9.0.176_1 pytorch

cc @ezyang @gchanan @zou3519

rmcavoy commented 5 years ago

It also fails when done with pytorch on CPU and Numpy so it isn't a bug with pytorch. I imagine it is just a floating point roundoff error since you have lots of numbers multiplied together and then added up.

Both

A = torch.randn((11111111, 20))
B = torch.randn((20, 2))
print((A @ B)[8807984:].equal(A[8807984:] @ B))

and

A = np.random.randn(11111111, 20)
B = np.random.randn(20, 2)
print(numpy.array_equal((A @ B)[8807984:],A[8807984:] @ B))

return false.

dschaehi commented 5 years ago

I get True in both cases.

Chillee commented 5 years ago

@dschaehi I get True for the first one and False for the second one, which lends more credence to the idea that it's a floating point roundoff error.

For numpy, if I print out the difference, I get something on the order of 1e-16.

>>> A = np.random.randn(11111111, 20)
>>> B = np.random.randn(20, 2)
>>> np.array_equal((A @ B)[8807984:], A[8807984:] @ B)
False
>>> np.max((A @ B)[8807984:] - A[8807984:] @ B)
9.992007221626409e-16

dschaehi commented 5 years ago

Hmm, can you check what you get if you run the following?

A = torch.randn((11111111, 20), device=torch.device("cuda"))
B = torch.randn((20, 2), device=torch.device("cuda"))
print(((A @ B)[8807984:] - A[8807984:] @ B).abs().max())

Expected return value is 0.0 but I get 27.1736, which is really large.

-- I found this issue when I was running a neural network model on a validation set with a large batch size and realized that the model gave a completely different validation result compared to the case when smaller batch size is used. Since this could cause a serious problem for deep learning practitioner, it should be mentioned somewhere in the PyTorch documentation?

rmcavoy commented 5 years ago

I got tensor(7.6294e-06, device='cuda:0') which is of order single floating point machine epsilon.

dschaehi commented 5 years ago

@rmcavoy Isn't it still weird that A's row size causes a floating point problem (and not A's column size), considering the fact that (A @ B)[i, j] = A[i, :] @ B[:, j]? Perhaps one needs to understand what happens under the hood of the optimized matrix multiplication algorithms...

dschaehi commented 5 years ago

Hmm, can you check what you get if you run the following?
A = torch.randn((11111111, 20), device=torch.device("cuda"))
B = torch.randn((20, 2), device=torch.device("cuda"))
print(((A @ B)[8807984:] - A[8807984:] @ B).abs().max())
Expected return value is 0.0 but I get 27.1736, which is really large.

After upgrading my NVIDIA CUDA toolkit to version 10.1 and PyTorch to 1.2.0 I got the following result: tensor(4.7684e-06, device='cuda:0'). So the effect of the floating point error problem could be reduced by upgrading CUDA toolkit and PyTorch.

ezyang commented 5 years ago

This is probably the same thing as #22078

gchanan commented 4 years ago

I don't think this is high priority; the linked issue shows that newer versions of PyTorch will warn if you are running under the conditions which trigger the bug. If you want to argue we should warn and use a slower path I could buy that.

pytorch / pytorch