pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.04k stars 22k forks source link

`torch.multinomial` generates incorrect distribution #132395

Open ptrblck opened 1 month ago

ptrblck commented 1 month ago

šŸ› Describe the bug

KFrank and jeffc narrowed down an issue in torch.multinomial generating duplicated permutations using a low number of iterations. Details can be found here: https://discuss.pytorch.org/t/bug-in-torch-multinomial-generated-distribution-is-modestly-incorrect-edit-this-is-a-regression-and-appears-to-be-due-to-an-analogous-bug-in-tensor-exponential/207149/5

Cross-posting:

Thatā€™s really interesting that it is exponential which seems to be the root cause, and multinomial surfaces the bug just because it has a dependency on exponential. It definitely broadens the scope of code that might be impacted.

I am not sure, but I think the bug may originate with the C++ implementation of exponential_ which may be in this file - pytorch/aten/src/ATen/native/cpu/DistributionKernels.cpp at 4c2bcf92cbecd36b7881904bceb8dc50c9b9741d Ā· pytorch/pytorch Ā· GitHub 1

The function exponential_kernel() seems to get a seed from the PyTorch random number generator, then use it as the seed to a different random number generator.

I can add more details once I'm back at my workstation.

Versions

torch==2.4.0

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @fritzo @neerajprad @alicanb @nikitaved @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @frank-wei

jeffcoop commented 1 month ago

I updated the discussion thread with details about why the use of VSL_BRNGMCG31 in vslNewStream dooms exponential to returning just a subset of a predetermined ordered list of (2^31-1) possible values. This means any sequence returned from exponential_ is just a block of N sequential elements from the larger series of 2 billion. Youā€™ll always see the same exact series of numbers generated in the same exact order.

bernoulli or any other method using VSL_BRNG_MCG31 likely suffers from a similar problem.

Here's the discussion thread again for reference: https://discuss.pytorch.org/t/bug-in-torch-multinomial-generated-distribution-is-modestly-incorrect-edit-this-is-a-regression-and-appears-to-be-due-to-an-analogous-bug-in-tensor-exponential/207149/5

malfet commented 1 month ago

Please note, that this only affects x86_64 platform, on aarch64 default RNG is used which seems to be free from that issue.