salesforce / awd-lstm-lm

LSTM and QRNN Language Model Toolkit for PyTorch
BSD 3-Clause "New" or "Revised" License
1.96k stars 488 forks source link

GPU memory and cap #29

Closed cerisara closed 6 years ago

cerisara commented 6 years ago

Hi, training crashed not enough memory on Titan X 12GB with char-LSTM on enwik8

The trick about reducing the "cap" on sequence length links to a 404 URL: could you please let me know where I can do that ?

Thanks a lot for the great code !

cerisara commented 6 years ago

OK, I think I got it: it's in line 186 isn't it ?

cerisara commented 6 years ago

Nope, uncommenting this line does not help:

| epoch   1 |   600/ 3515 batches | lr 0.00100 | ms/batch 1721.19 | loss  1.57 | ppl     4.82 | bpc    2.270
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "main.py", line 241, in <module>
    train()
  File "main.py", line 198, in train
    output, hidden, rnn_hs, dropped_rnn_hs = model(data, hidden, return_h=True)
  File "/home/xtof/envs/pytorchnew/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xtof/git/awd-lstm-lm/model.py", line 82, in forward
    raw_output, new_h = rnn(raw_output, hidden[l])
  File "/home/xtof/envs/pytorchnew/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xtof/git/awd-lstm-lm/weight_drop.py", line 47, in forward
    return self.module.forward(*args)
  File "/home/xtof/envs/pytorchnew/lib/python3.5/site-packages/torch/nn/modules/rnn.py", line 204, in forward
    output, hidden = func(input, self.all_weights, hx)
  File "/home/xtof/envs/pytorchnew/lib/python3.5/site-packages/torch/nn/_functions/rnn.py", line 385, in forward
    return func(input, *fargs, **fkwargs)
  File "/home/xtof/envs/pytorchnew/lib/python3.5/site-packages/torch/autograd/function.py", line 328, in _do_forward
    flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
  File "/home/xtof/envs/pytorchnew/lib/python3.5/site-packages/torch/autograd/function.py", line 350, in forward
    result = self.forward_extended(*nested_tensors)
  File "/home/xtof/envs/pytorchnew/lib/python3.5/site-packages/torch/nn/_functions/rnn.py", line 294, in forward_extended
    cudnn.rnn.forward(self, input, hx, weight, output, hy)
  File "/home/xtof/envs/pytorchnew/lib/python3.5/site-packages/torch/backends/cudnn/rnn.py", line 281, in forward
    fn.reserve = torch.cuda.ByteTensor(reserve_size.value)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58
cerisara commented 6 years ago

OK, I managed to make it fit within a 12GB GPU by reducing the bptt down to 100 I don't know what will be the impact on BPC, I'll check in... 50 hours ;-)

cerisara commented 6 years ago

Looks like the impact of reducing bptt to 100 is not huge, as I get BPC=1.17 on the dev after 50 epochs. So it's a viable option when you get out of memory errors !