pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
83.95k stars 22.62k forks source link

Inconsistent nn.KLDivLoss behavior: 0s in target OK on cpu, but gives nan on mps #98269

Open sidsrinivasan opened 1 year ago

sidsrinivasan commented 1 year ago

🐛 Describe the bug

KLDivLoss is supposed to take the log of a probability distribution as target, sometimes this target contains 0s. This is handled correctly when device='cpu', but when device='mps' we get nans. Current workaround is to add some small eps to the target.


torch.manual_seed(1)

x = torch.rand(10, 8, device='mps')
x = x / x.sum(dim=1, keepdim=True)
x = log_softmax(x, dim=-1)

y = torch.rand(10, 8, device='mps')
y = y / y.sum(dim=1, keepdim=True)

criterion = nn.KLDivLoss(reduction="sum")
print(criterion(x, y), criterion(x.to('cpu'), y.to('cpu')))

# mask out random entries of y
mask = torch.rand(10, 8, device='mps') < 0.5
y = y * mask
print(criterion(x, y), criterion(x.to('cpu'), y.to('cpu')))

Outputs: tensor(1.6974, device='mps:0') tensor(1.6974) tensor(nan, device='mps:0') tensor(1.0370)

Versions


PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 13.2.1 (x86_64)
GCC version: Could not collect
Clang version: 14.0.0 (clang-1400.0.29.202)
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.10 (main, Mar 21 2023, 13:41:39) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-10.16-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Pro

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==2.0.0
[pip3] torchaudio==2.0.0
[pip3] torchvision==0.15.0
[conda] blas                      1.0                         mkl  
[conda] ffmpeg                    4.3                  h0a44026_0    pytorch
[conda] mkl                       2021.4.0           hecd8cb5_637  
[conda] mkl-service               2.4.0           py310hca72f7f_0  
[conda] mkl_fft                   1.3.1           py310hf879493_0  
[conda] mkl_random                1.2.2           py310hc081a56_0  
[conda] numpy                     1.23.5          py310h9638375_0  
[conda] numpy-base                1.23.5          py310ha98c3c9_0  
[conda] pytorch                   2.0.0                  py3.10_0    pytorch
[conda] torchaudio                2.0.0                 py310_cpu    pytorch
[conda] torchvision               0.15.0                py310_cpu    pytorch```

cc @kulinseth @albanD @malfet @DenisVieriu97 @razarmehr @abhudev
pat749 commented 1 year ago

Yes This seems to be a known issue with KLDivLoss on the MPS device when the target contains 0s. A possible workaround is to add a small epsilon value to the target to avoid taking the log of 0 which you already told .

qqaatw commented 1 year ago

Can't reproduce this on Mac M1 with macOS 12.6. Probably a bug coming from Intel Mac only.

sidsrinivasan commented 1 year ago

This was on Mac M1 but with macOS 13.2.1, so maybe it's the macOS?

qqaatw commented 1 year ago

Hmm, from your environment information it shows Python platform: macOS-10.16-x86_64-i386-64bit.

It seems that you're running Python using Rosetta 2. Maybe the macOS version is related as well, but can you try running Python natively without Rosetta 2 and see if the problem persists?

htdai commented 1 year ago

I hope this information is helpful: I can exactly reproduce this on my Mac M1 with macOS 13.4, running Python natively.

Versions

PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 13.4 (arm64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.3 (main, Apr 19 2023, 18:49:55) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-13.4-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Max

Versions of relevant libraries:
[pip3] numpy==1.25.0
[pip3] torch==2.0.1
[pip3] torchaudio==2.0.2
[pip3] torchdata==0.6.1
[pip3] torchtext==0.15.2
[pip3] torchvision==0.15.2
[conda] numpy                     1.25.0          py311he598dae_0  
[conda] numpy-base                1.25.0          py311hfbfe69c_0  
[conda] pytorch                   2.0.1                  py3.11_0    pytorch
[conda] torchaudio                2.0.2                 py311_cpu    pytorch
[conda] torchdata                 0.6.1                     py311    pytorch
[conda] torchtext                 0.15.2                    py311    pytorch
[conda] torchvision               0.15.2                py311_cpu    pytorch
sidsrinivasan commented 10 months ago

OK, it looks like the issue does not replicate when running Python natively OR on Rosetta2 on macOS 14.2.1 -- there is no bug.

It looks like this issue only occurs on macOS 13 (both when running Python natively and using Rosetta 2)?