The MPS Backend sometimes samples outside of distribution domain with `multinomial`

🐛 Describe the bug

Using the MPS backend, it is possible to sample elements outside of the domain when using multinomial. See below for code snippet:

import torch
import torch.distributions

device = torch.device("mps")
# 10 dimensional distribution, expected max output is 9
violating_dist = torch.tensor([4.3330236804e-04, 1.6706718498e-07, 5.6105983504e-07, 2.5240040486e-05,
        5.4649823142e-05, 5.5108112283e-03, 9.9348586798e-01, 4.5977579077e-08,
        4.8896443332e-04, 3.4132514770e-07], device=device)

sample = torch.multinomial(violating_dist, 100000000, True)
# >> 11, outside domain!
print(torch.max(sample))

# This distribution is the one above with default printing precision
almost_similar_non_violating_dist = torch.tensor([4.3330e-04, 1.6707e-07, 5.6106e-07, 2.5240e-05, 5.4650e-05, 5.5108e-03,
                               9.9349e-01, 4.5978e-08, 4.8896e-04, 3.4133e-07], device=device)
sample = torch.multinomial(almost_similar_non_violating_dist, 100000000, True)
# >> 9
print(torch.max(sample))

# Violating distribution but on cpu
sample = torch.multinomial(violating_dist.cpu(), 100000000, True)
# >> 9
print(torch.max(sample))

So for some reason, on MPS this particular probability tensor sometimes samples an 11 even though there are only 10 elements it can sample from (and hence the maximum should be 9). Furthermore, it doesn't happen with the same tensor when defined with lower precision, nor does it happen with the CPU backend.

Versions

Collecting environment information... PyTorch version: 2.4.1 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: macOS 14.7 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.3.9.4) CMake version: Could not collect Libc version: N/A

Python version: 3.9.6 (default, Feb 3 2024, 15:58:27) [Clang 15.0.0 (clang-1500.3.9.4)] (64-bit runtime) Python platform: macOS-14.7-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Apple M3 Max

Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==2.0.2 [pip3] storchastic==0.3.7 [pip3] torch==2.4.1 [pip3] torchvision==0.19.1 [conda] No relevant packages

cc @kulinseth @albanD @malfet @DenisVieriu97 @jhavukainen

pytorch / pytorch

The MPS Backend sometimes samples outside of distribution domain with `multinomial` #136623

🐛 Describe the bug

Versions