pytorch / extension-cpp

C++ extensions in PyTorch
1.02k stars 214 forks source link

Consistency problem with check and modules regarding bias #10

Closed ClementPinard closed 6 years ago

ClementPinard commented 6 years ago

The problem only occurs on pytorch master, because it's backprop engine is less compliant : when running benchmark.py cpp (or cuda) :

Traceback (most recent call last):
  File "benchmark.py", line 43, in <module>
    (new_h.sum() + new_C.sum()).backward()
  File "/home/cpinard/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/cpinard/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function LLTMFunctionBackward returned an invalid gradient at index 2 - expected shape [384] but got [1, 384]

This is due to modules bias parameter to be of size 3 * state_size while the backward outputs a tensor of size 1 x 3 * state_size . The problem is still here for torch 0.4.0, but the backprop engine doesn't complaint as the number of elements is the same.

So the solution could be to remove the keepdim=True in the d_bias computing e.g. here (but it's the same for python baseline, cpp and cuda)

But then you get the opposite error message when running check.py and grad_check.py :

Traceback (most recent call last):
  File "check.py", line 107, in <module>
    check_backward(variables, options.cuda, options.verbose)
  File "check.py", line 53, in check_backward
    (baseline_values[0] + baseline_values[1]).sum().backward()
  File "/home/cpinard/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/cpinard/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function LLTMFunctionBackward returned an invalid gradient at index 2 - expected shape [1, 15] but got [15]

This is because now the bias given to the function is of size 1 x 15 !

The solution is pretty simple, but needs to decide on what to do :

goldsborough commented 6 years ago

We recently added code to verify the gradient shape (https://github.com/pytorch/pytorch/pull/8168), so it's expected that this would break. I'll fix it

goldsborough commented 6 years ago

Fixed on master

ClementPinard commented 6 years ago

Thanks for fixing it. However you forgot to change it on python/lltm_baseline.py. The module is actually never called whether from benchmark.py or [grad_]check.py so it doesn't trigger an error, but if you call it, you will also have the problem.