haste_pytorch: Gradient for kernel/recurrent_kernel becomes zero when trained on gpu

tyterry commented 3 years ago

Hi I have been trying to haste_pytorch (the trainning speed of haste is phenomenal!) but I found that the gradients for kernel/recurrent_kernel become zero when the model is trained on gpu. The below is a simple code snippets I tried to test on:

lstm_layer = haste.LSTM(input_size=128, hidden_size=256, batch_first = True)
output = torch.nn.Linear(256*5, 1)

lstm_layer.cuda()
output.cuda()

x = torch.rand([1, 5, 128]).cuda()
target = torch.zeros(1).cuda()
loss_func = torch.nn.MSELoss()
optim = torch.optim.Adam(list(lstm_layer.parameters()) + list(output.parameters()))

for i in range(5):
    y, _ = lstm_layer(x)
    y = y.contiguous().view(1,-1)
    y = output(y).squeeze()

    loss = loss_func(y, target)
    loss.backward()
    optim.step()
    for n, p in lstm_layer.named_parameters():
        print(n, p.grad)
    optim.zero_grad()

Print out: kernel tensor([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], device='cuda:0') recurrent_kernel tensor([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], device='cuda:0') bias tensor([-1.8202e-10, 3.7714e-09, 2.8942e-09, ..., 1.0455e-08, 2.6969e-09, 1.6647e-08], device='cuda:0')

The gradients for kernel/recurrent_kernel become non-zero once "cuda()" are replaced by "cpu()".

Most grateful if you can provide some insight on it.

Many thanks for your help.

sharvil commented 3 years ago

Hmm, that's unusual. I ran your code on 1080Ti and 2080Ti GPUs and I get non-zero gradients. Can you share which GPU, CUDA version, and PyTorch version you're using?

sharvil commented 3 years ago

I ran the code above in a Colab notebook and wasn't able to reproduce there either (K80 GPU, PyTorch 1.8, CUDA 10.1).

tyterry commented 3 years ago

Thanks for your prompt reply! Currently I am using gtx1060, cuda version 10.1 and pytorch version 1.8. In that case there maybe something wrong with my setup. I will try to reinstall the libraries and cuda to see if it helps. Will update you the result once finished.

sharvil commented 3 years ago

Any update on this issue, @tyterry?

lmnt-com / haste

haste_pytorch: Gradient for kernel/recurrent_kernel becomes zero when trained on gpu #31