ucbrise / actnn

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
MIT License
196 stars 30 forks source link

QConv1d: no valid convolution algorithms available in CuDNN #6

Closed xesdiny closed 3 years ago

xesdiny commented 3 years ago

In https://github.com/ucbrise/actnn/blob/main/tests/test_conv_layer.py line 52-56

I just got this message when trying to run test_conv_layer.py, getting the following stacktrace:

~/code/actnn/tests$ CUDA_VISIBLE_DEVICES=1 python test_conv_layer.py
Conv1d(100, 4, kernel_size=(3,), stride=(2,), groups=2)
QConv1d(100, 4, kernel_size=(3,), stride=(2,), groups=2)
torch.Size([4, 50, 3])
torch.Size([10, 100, 2000]) tensor([2, 0, 3, 0, 0, 2, 3, 0, 2, 1], device='cuda:0')
Traceback (most recent call last):
  File "test_conv_layer.py", line 60, in <module>
    test(layer, qlayer, x, y)
  File "test_conv_layer.py", line 33, in test
    grads.append(get_grad(qlayer))
  File "test_conv_layer.py", line 27, in get_grad
    loss.backward()
  File "/data/users/root/anaconda3/envs/jukebox/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/data/users/root/anaconda3/envs/jukebox/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
  File "/data/users/root/anaconda3/envs/jukebox/lib/python3.7/site-packages/torch/autograd/function.py", line 89, in apply
    return self._forward_cls.backward(self, *args)  # type: ignore
  File "/data/users/root/code/actnn/actnn/actnn/ops.py", line 244, in backward
    return convnd.run_backward(1, ctx, grad_output, [0, 2], _single)
  File "/data/users/root/code/actnn/actnn/actnn/ops.py", line 225, in run_backward
    [ctx.needs_input_grad[0], ctx.needs_input_grad[1]])
RuntimeError: no valid convolution algorithms available in CuDNN
xesdiny commented 3 years ago

Is anybody there?

merrymercy commented 3 years ago

conv1d has bugs. We forget to delete the code.

Please do not use conv1d. You have to manually rewrite your models with Conv2D. You can think of conv1d as a special case of conv2d and rewrite your model, so this is not hard. PyTorch also did this internally for nn.Conv1d

xesdiny commented 3 years ago

Thx a lot~