Open wlhgtc opened 6 years ago
@wlhgtc Thank you for your report. I am not sure about this. But cuda out of memory
is exactly the problem of your GPU memory. Is your GPU memory sufficient? If you sum up Variables directly, this problem will happen. If so, .detach
or .data
may be helpful.
@jojonki Thanks for your reply. I debug my code the whole day. I test my model layer by layer(I comment out the backward step and optimizer step). The "out of memory error" occur when I compute the matrix S(S=W[H,U,HºU] ) with the batch size 60(according to the paper) . But I find you config is 20. The model goes well in 30 size.
Finally , I find a phenomenon in 30 size : the memory wen in 9G at first and went down for7.2G finally remain steady. I don't know how you deal with the data. I use the torchtext package, for each batch, this package will padding the context according to the max length of context automatically . I think there are some context that are too long(in some batches) so that the memory run out!
So I wonder if you padding the context in batch the same as me. And why you set the batch size 20?
By the way ,I use GTX 1080Ti with Pytorch 0.3!
Is the memory increasing in your case? Mine runs out of memory in the middle of training.
[20180625-174613] Epoch 0 74.2%, loss_p1: 3.338, loss_p2: 2.325
p1 acc: 9.000% (6077/65000), p2 acc: 10.000% (6521/65000)
75%|████████████████████████████████████████████▋ | 3266/4379 [07:33<02:34, 7.21it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "main.py", line 237, in <module>
train(model, train_data, optimizer, ema, start_epoch=args.start_epoch)
File "main.py", line 153, in train
(loss_p1+loss_p2).backward()
File "/home/vimos/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/vimos/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu:58
@Vimos There are some sentences with lenth>500. You'd better set them with a fixed length(for me 300)
@wlhgtc Thanks for the advice.
If I keep using the default value for length, I have to change to a smaller batch size of 10. This still require 7709MiB
memory.
If the memory keeps steady, it's fine. It's a large model.
Thanks for your code, it helps me a lot. And I try to write on my own but I meet some questions. When I rewrite the loss function as follows: `class Custom_Loss(nn.Module): def init(self): super(Custom_Loss, self).init()
I meet the error: cuda out of memory, I check my code and could not find the reason, can you help me?