Open sidsrinivasan opened 1 year ago
Yes This seems to be a known issue with KLDivLoss on the MPS device when the target contains 0s. A possible workaround is to add a small epsilon value to the target to avoid taking the log of 0 which you already told .
Can't reproduce this on Mac M1 with macOS 12.6. Probably a bug coming from Intel Mac only.
This was on Mac M1 but with macOS 13.2.1, so maybe it's the macOS?
Hmm, from your environment information it shows Python platform: macOS-10.16-x86_64-i386-64bit
.
It seems that you're running Python using Rosetta 2. Maybe the macOS version is related as well, but can you try running Python natively without Rosetta 2 and see if the problem persists?
I hope this information is helpful: I can exactly reproduce this on my Mac M1 with macOS 13.4, running Python natively.
Versions
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 13.4 (arm64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: Could not collect
Libc version: N/A
Python version: 3.11.3 (main, Apr 19 2023, 18:49:55) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-13.4-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Apple M1 Max
Versions of relevant libraries:
[pip3] numpy==1.25.0
[pip3] torch==2.0.1
[pip3] torchaudio==2.0.2
[pip3] torchdata==0.6.1
[pip3] torchtext==0.15.2
[pip3] torchvision==0.15.2
[conda] numpy 1.25.0 py311he598dae_0
[conda] numpy-base 1.25.0 py311hfbfe69c_0
[conda] pytorch 2.0.1 py3.11_0 pytorch
[conda] torchaudio 2.0.2 py311_cpu pytorch
[conda] torchdata 0.6.1 py311 pytorch
[conda] torchtext 0.15.2 py311 pytorch
[conda] torchvision 0.15.2 py311_cpu pytorch
OK, it looks like the issue does not replicate when running Python natively OR on Rosetta2 on macOS 14.2.1 -- there is no bug.
It looks like this issue only occurs on macOS 13 (both when running Python natively and using Rosetta 2)?
🐛 Describe the bug
KLDivLoss is supposed to take the log of a probability distribution as target, sometimes this target contains 0s. This is handled correctly when device='cpu', but when device='mps' we get nans. Current workaround is to add some small eps to the target.
Outputs:
tensor(1.6974, device='mps:0') tensor(1.6974)
tensor(nan, device='mps:0') tensor(1.0370)
Versions