pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
83.98k stars 22.64k forks source link

torch.norm produces incorrect results #20551

Open ecvgit opened 5 years ago

ecvgit commented 5 years ago

🐛 Bug

torch.norm gives incorrect results on CPU in the latest nightly build as well as in 1.1.0 stable.

To Reproduce


>>>  import torch
>>> a=torch.rand(2000,2000,64)
>>> b=torch.norm(a)
>>> c=torch.norm(a.cuda())
>>> b,c
(tensor(5792.6187), tensor(9237.8311, device='cuda:0'))

Expected behavior

Both b and c should have the same values.

Environment

PyTorch version: 1.1.0.dev20190514 Is debug build: No CUDA used to build PyTorch: 9.0.176

OS: Red Hat Enterprise Linux Server release 7.4 (Maipo) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16) CMake version: version 2.8.12.2

Python version: 3.6 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: GPU 0: Tesla K40m GPU 1: Tesla K40m

Nvidia driver version: 387.26 cuDNN version: Could not collect

Versions of relevant libraries: [pip3] msgpack-numpy==0.4.1 [pip3] numpy==1.14.3 [pip3] torch==0.4.0 [pip3] torchtext==0.2.3 [pip3] torchvision==0.2.1 [conda] blas 1.0 mkl
[conda] mkl 2018.0.2 1
[conda] mkl_fft 1.0.1 py36h3010b51_0
[conda] mkl_random 1.0.1 py36h629b387_0
[conda] pytorch-nightly 1.1.0.dev20190514 py3.6_cuda9.0.176_cudnn7.5.1_0 pytorch [conda] torchtext 0.2.3

Additional context

soumith commented 5 years ago

i think this is float precision issues.

in float64, it seems to be working fine:

In [1]: import torch

In [2]: a=torch.rand(2000,2000,64)
a[0
In [3]: a[0][0][0]
Out[3]: tensor(0.4834)

In [4]: a=torch.rand(2000,2000,64, dtype=torch.float64)

In [5]: b=torch.norm(a)

In [6]: c=torch.norm(a.cuda())

In [7]: b, c
Out[7]:
(tensor(9237.5918, dtype=torch.float64),
 tensor(9237.5918, device='cuda:0', dtype=torch.float64))
soumith commented 5 years ago

cc: @umanwizard to double-check, i think you made norm TensorIterator compatible if I remember (sorry if it wasn't you)

umanwizard commented 5 years ago

No, it was done by @jjsjann123 in #15414

ecvgit commented 5 years ago

The code above gives correct results in pytorch 0.4, but not in pytorch 1.1.0. If it is a float precision issue, not sure why it works correctly in pytorch 0.4

colesbury commented 5 years ago

If it is a float precision issue, not sure why it works correctly in pytorch 0.4

PyTorch 0.4 did the accumulation using double https://github.com/pytorch/pytorch/blob/v0.4.1/aten/src/TH/generic/THTensorMath.cpp#L4307

Now it's using float accumulation: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp#L57

CUDA uses float accumulation, but is saved because the necessary parallelism forces a form of pairwise summation. We should probably do the same thing for CPU.

jjsjann123 commented 5 years ago

Like Sam mentioned, numerical behaviors are very differently on CPU/GPU because of the level of parallelism. Sacrifices like this (using double instead of float) is necessary for CPU, but it is not on CUDA code and would hurt performance. We probably could plumb some logic through so we use different accumulation on each device. But really the shared code between CPU/GPU kernels are not significant. It might be a easier to just keep CPU/GPU kernels separate.

colesbury commented 5 years ago

@jjsjann123 I was suggesting something different: that we use pairwise summation on the CPU in reduction kernels -- not that we switch to double accumulation.

mruberry commented 4 years ago

This is still an issue.

foreverlms commented 6 months ago

This still exists. Anyone take a look? or change the aten operator of torch cpu backend.