Crash with SIGFPE due to unhandled cases in distributions.MultivariateNormal

praveen-palanisamy commented 6 years ago

Issue description

With the scalar support in Tensor from PyTorch 0.4, torch.distributions.MultivariateNormal crashes if loc (mean of the distribution) is a scalar (0-dimensional Tensor) although such an input is currently valid . It neither raises a ValueError in torch.distributions.MultivariateNormal.__init__ nor is caught by the real_vector constraint on the loc argument.

A minimal test code is below to reproduce the clueless SIGFPE crash.

Code example

#!/usr/bin/env python
"""
Script to test/reproduce crashes with SIGFPE due to unhandled cases(scalar loc) in distributions.MultivariateNormal
"""
import torch

def test_univariate_scalar_input(loc=0.5, variance=0.1):
    mu = torch.tensor(loc)
    sigma = torch.tensor(variance)
    distribution = torch.distributions.MultivariateNormal(mu, torch.eye(1) * sigma)
    sample = distribution.sample()
    print(sample)

def test_univariate_scalar_input_with_args_validation(loc=0.5, variance=0.1):
    mu = torch.tensor(loc)
    sigma = torch.tensor(variance)
    distribution = torch.distributions.MultivariateNormal(mu, torch.eye(1) * sigma, validate_args=True)
    sample = distribution.sample()
    print(sample)

def test_univariate_input(loc=([0.5]), variance=0.1):
    mu = torch.tensor(loc)
    sigma = torch.tensor(variance)
    distribution = torch.distributions.MultivariateNormal(mu, torch.eye(1) * sigma)
    sample = distribution.sample()
    print(sample)

def test_univariate_input_with_args_validation(loc=([0.5]), variance=0.1):
    mu = torch.tensor(loc)
    sigma = torch.tensor(variance)
    distribution = torch.distributions.MultivariateNormal(mu, torch.eye(1) * sigma, validate_args=True)
    sample = distribution.sample()
    print(sample)

if __name__ == "__main__":
    test_univariate_scalar_input(loc=0.5, variance=0.1)  # Crashes with Floating point exception (SIGFPE)
    #test_univariate_scalar_input_with_args_validation(loc=0.5, variance=0.1)  #Crashes with Floating point exception (SIGFPE)
    #test_univariate_input(loc=([0.5]), variance=0.1)  # Runs without errors. Haven't verified if samples are from the correct normal distribution
    #test_univariate_input_with_args_validation(loc=([0.5]), variance=0.1)  # Runs without errors. Haven't verified if samples are from the correct normal distribution

I will be happy to submit a PR if you think this needs a fix.

System Info

PyTorch or Caffe2: PyTorch
How you installed PyTorch (conda, pip, source): conda
Build command you used (if compiling from source): NA
OS: Ubuntu 16.04
PyTorch version: 0.4.0
Python version: 3.5.5
CUDA/cuDNN version: NA
GPU models and configuration: NA
GCC version (if compiling from source): NA
CMake version: NA
Versions of any other relevant libraries: NA

vishwakftw commented 6 years ago

The error occurs due to the _batch_mv in the rsample method. When loc is a scalar, the eps is tensor([]), which seems to be causing an issue.

https://github.com/pytorch/pytorch/blob/302408e6c225bdd0fe9c6af9108c95d10dfb6ce4/torch/distributions/multivariate_normal.py#L171-L174

cc: @fritzo @apaszke

praveen-palanisamy commented 6 years ago

Below is the gdb log with the backtrace:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGFPE, Arithmetic exception.
0x00007fffd1072cee in at::native::reshape(at::Tensor const&, at::ArrayRef<long>) ()
   from /home/praveen/anaconda3/envs/DRL/lib/python3.5/site-packages/torch/lib/libATen.so
(gdb) bt
#0  0x00007fffd1072cee in at::native::reshape(at::Tensor const&, at::ArrayRef<long>) ()
   from /home/praveen/anaconda3/envs/DRL/lib/python3.5/site-packages/torch/lib/libATen.so
#1  0x00007fffd12f5996 in at::Type::reshape(at::Tensor const&, at::ArrayRef<long>) const ()
   from /home/praveen/anaconda3/envs/DRL/lib/python3.5/site-packages/torch/lib/libATen.so
#2  0x00007fffe791eef6 in torch::autograd::VariableType::reshape(at::Tensor const&, at::ArrayRef<long>) const ()
   from /home/praveen/anaconda3/envs/DRL/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so
#3  0x00007fffe7b68b5b in torch::autograd::THPVariable_reshape ()
   from /home/praveen/anaconda3/envs/DRL/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so
#4  0x00005555556a0718 in PyCFunction_Call ()
#5  0x00005555556f648c in PyEval_EvalFrameEx ()
#6  0x00005555556f6b40 in PyEval_EvalFrameEx ()
#7  0x00005555556fb2d0 in PyEval_EvalFrameEx ()
#8  0x00005555556fb2d0 in PyEval_EvalFrameEx ()
#9  0x00005555556fb2d0 in PyEval_EvalFrameEx ()
#10 0x0000555555700c3d in PyEval_EvalCodeEx ()
#11 0x0000555555701b6c in PyEval_EvalCode ()
#12 0x000055555575ed54 in run_mod ()
#13 0x00005555557603c1 in PyRun_FileExFlags ()
#14 0x00005555557605de in PyRun_SimpleFileExFlags ()
#15 0x0000555555760c8d in Py_Main ()
#16 0x000055555562c031 in main ()

fritzo commented 6 years ago

I believe loc as scalar should not be allowed for MultivariateNormal; we should instead add a check

def __init__(...):
    if loc.dim() < 1:
        raise ValueError

Alternatively, we could broadcast scalar loc up to a 1-dimensional tensor in __init__().

praveen-palanisamy commented 6 years ago

Though one shouldn't try to use MultivariateNormal for a univariate, scalar loc, I would :+1: +1 for the alternative way as it does not add any overhead while making it user-friendly. I think this will be better compared to raising an exception which will most likely trigger users to add a fake dimension to the scalar and try which anyway works with MultivariateNormal on master. What do you think?

ssnl commented 6 years ago

Also, pure Python code shouldn't crash with SIGFPE. We should fix this.

vishwakftw commented 6 years ago

Is it rather this line causing the trouble? I think it should be self.loc.new(shape).normal_().

https://github.com/pytorch/pytorch/blob/302408e6c225bdd0fe9c6af9108c95d10dfb6ce4/torch/distributions/multivariate_normal.py#L173

praveen-palanisamy commented 6 years ago

@vishwakftw : Did you test it with your suggestion? It will cause a RuntimeError dimension specified as -1 but tensor has no dimensions at this line: https://github.com/pytorch/pytorch/blob/302408e6c225bdd0fe9c6af9108c95d10dfb6ce4/torch/distributions/multivariate_normal.py#L32

I think the SIGFPE arises when MultivariateNormal's self.scale_tril is expanded and reshaped to have .dim() == 3 ( so as to conform with torch.bmm's interface) when eps is a scalar. Specifically, in this line: https://github.com/pytorch/pytorch/blob/302408e6c225bdd0fe9c6af9108c95d10dfb6ce4/torch/distributions/multivariate_normal.py#L36

The root cause in my opinion is that the self._extended_shape(...) method in distribution.py fails to upcast the returned shape to (1,) when the batch_shape and the event_shape is empty which causes eps to be a 0 dimensional tensor leading to the SIGFPE.

In the PR #8543 , I upcasted the loc in MultivariateNormal.__init__(...) if loc is a scalar. To me, that looked transparent and easy to follow than if it was done inside the _extended_shape(...) method by upcasting the event_shape when the loc's shape was the root cause. Let me know if someone thinks, it is better to do it in _extended_shape(...) due to some reason (like, taking care of scalars in other distributions as well).

vishwakftw commented 6 years ago

There needs to be two changes in the code:

bvec.size(-1) to bmat.size(-1) in https://github.com/pytorch/pytorch/blob/302408e6c225bdd0fe9c6af9108c95d10dfb6ce4/torch/distributions/multivariate_normal.py#L32
*shape to shape in https://github.com/pytorch/pytorch/blob/302408e6c225bdd0fe9c6af9108c95d10dfb6ce4/torch/distributions/multivariate_normal.py#L173

The sample function works fine. The distributions test suite passes as well.

>>> import torch
>>> m = torch.distributions.MultivariateNormal(torch.tensor(0.1), torch.tensor(0.5) * torch.eye(1))
>>> m.sample()
tensor([-0.2011])

pytorch / pytorch