Closed praveen-palanisamy closed 6 years ago
The error occurs due to the _batch_mv
in the rsample
method. When loc
is a scalar, the eps
is tensor([])
, which seems to be causing an issue.
cc: @fritzo @apaszke
Below is the gdb log with the backtrace:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Program received signal SIGFPE, Arithmetic exception.
0x00007fffd1072cee in at::native::reshape(at::Tensor const&, at::ArrayRef<long>) ()
from /home/praveen/anaconda3/envs/DRL/lib/python3.5/site-packages/torch/lib/libATen.so
(gdb) bt
#0 0x00007fffd1072cee in at::native::reshape(at::Tensor const&, at::ArrayRef<long>) ()
from /home/praveen/anaconda3/envs/DRL/lib/python3.5/site-packages/torch/lib/libATen.so
#1 0x00007fffd12f5996 in at::Type::reshape(at::Tensor const&, at::ArrayRef<long>) const ()
from /home/praveen/anaconda3/envs/DRL/lib/python3.5/site-packages/torch/lib/libATen.so
#2 0x00007fffe791eef6 in torch::autograd::VariableType::reshape(at::Tensor const&, at::ArrayRef<long>) const ()
from /home/praveen/anaconda3/envs/DRL/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so
#3 0x00007fffe7b68b5b in torch::autograd::THPVariable_reshape ()
from /home/praveen/anaconda3/envs/DRL/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so
#4 0x00005555556a0718 in PyCFunction_Call ()
#5 0x00005555556f648c in PyEval_EvalFrameEx ()
#6 0x00005555556f6b40 in PyEval_EvalFrameEx ()
#7 0x00005555556fb2d0 in PyEval_EvalFrameEx ()
#8 0x00005555556fb2d0 in PyEval_EvalFrameEx ()
#9 0x00005555556fb2d0 in PyEval_EvalFrameEx ()
#10 0x0000555555700c3d in PyEval_EvalCodeEx ()
#11 0x0000555555701b6c in PyEval_EvalCode ()
#12 0x000055555575ed54 in run_mod ()
#13 0x00005555557603c1 in PyRun_FileExFlags ()
#14 0x00005555557605de in PyRun_SimpleFileExFlags ()
#15 0x0000555555760c8d in Py_Main ()
#16 0x000055555562c031 in main ()
I believe loc
as scalar should not be allowed for MultivariateNormal
; we should instead add a check
def __init__(...):
if loc.dim() < 1:
raise ValueError
Alternatively, we could broadcast scalar loc
up to a 1-dimensional tensor in __init__()
.
Though one shouldn't try to use MultivariateNormal
for a univariate, scalar loc
, I would :+1: +1 for the alternative way as it does not add any overhead while making it user-friendly. I think this will be better compared to raising an exception which will most likely trigger users to add a fake dimension to the scalar and try which anyway works with MultivariateNormal
on master
.
What do you think?
Also, pure Python code shouldn't crash with SIGFPE. We should fix this.
Is it rather this line causing the trouble? I think it should be self.loc.new(shape).normal_()
.
@vishwakftw : Did you test it with your suggestion? It will cause a RuntimeError dimension specified as -1 but tensor has no dimensions
at this line:
https://github.com/pytorch/pytorch/blob/302408e6c225bdd0fe9c6af9108c95d10dfb6ce4/torch/distributions/multivariate_normal.py#L32
I think the SIGFPE
arises when MultivariateNormal
's self.scale_tril
is expanded and reshaped to have .dim() == 3
( so as to conform with torch.bmm
's interface) when eps
is a scalar. Specifically, in this line: https://github.com/pytorch/pytorch/blob/302408e6c225bdd0fe9c6af9108c95d10dfb6ce4/torch/distributions/multivariate_normal.py#L36
The root cause in my opinion is that the self._extended_shape(...)
method in distribution.py
fails to upcast the returned shape to (1,) when the batch_shape
and the event_shape
is empty which causes eps
to be a 0 dimensional tensor leading to the SIGFPE
.
In the PR #8543 , I upcasted the loc
in MultivariateNormal.__init__(...)
if loc
is a scalar. To me, that looked transparent and easy to follow than if it was done inside the _extended_shape(...)
method by upcasting the event_shape
when the loc
's shape was the root cause. Let me know if someone thinks, it is better to do it in _extended_shape(...)
due to some reason (like, taking care of scalars in other distributions as well).
There needs to be two changes in the code:
bvec.size(-1)
to bmat.size(-1)
in
https://github.com/pytorch/pytorch/blob/302408e6c225bdd0fe9c6af9108c95d10dfb6ce4/torch/distributions/multivariate_normal.py#L32
*shape
to shape
in
https://github.com/pytorch/pytorch/blob/302408e6c225bdd0fe9c6af9108c95d10dfb6ce4/torch/distributions/multivariate_normal.py#L173
The sample function works fine. The distributions test suite passes as well.
>>> import torch
>>> m = torch.distributions.MultivariateNormal(torch.tensor(0.1), torch.tensor(0.5) * torch.eye(1))
>>> m.sample()
tensor([-0.2011])
Issue description
With the scalar support in Tensor from PyTorch 0.4,
torch.distributions.MultivariateNormal
crashes ifloc
(mean of the distribution) is a scalar (0-dimensional Tensor) although such an input is currently valid . It neither raises aValueError
intorch.distributions.MultivariateNormal.__init__
nor is caught by thereal_vector
constraint on theloc
argument.A minimal test code is below to reproduce the clueless SIGFPE crash.
Code example
I will be happy to submit a PR if you think this needs a fix.
System Info