jojonki / BiDAF

Bidirectional Attention Flow for Machine Comprehension, Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi https://arxiv.org/abs/1611.01603
Apache License 2.0
143 stars 29 forks source link

cuda out of memory #10

Open wlhgtc opened 6 years ago

wlhgtc commented 6 years ago

Thanks for your code, it helps me a lot. And I try to write on my own but I meet some questions. When I rewrite the loss function as follows: `class Custom_Loss(nn.Module): def init(self): super(Custom_Loss, self).init()

def loss_function(self, data, labels):
    loss = Variable(torch.zeros(1))
    for d, l in zip(data, labels):
        loss -= torch.log(d[l]).cpu()
    loss /= data.size(0)
    return loss

def forward(self, p1, p2, S, E):
    """
    N for batch and T for length of context

    :param p1: A tensor (N,T) represents for possibility of  choosing each word as answer(start)
    :param p2: A tensor (N,T) represents for possibility of  choosing each word as answer(end)
    :param S: A tensor for each query's start position
    :param E: A tensor for each query's end position
    :return: Loss of the BiDAF model
    """

    l1 = self.loss_function(p1, S)
    l2 = self.loss_function(p2, E)
    loss=l1+l2
    return loss`

I meet the error: cuda out of memory, I check my code and could not find the reason, can you help me?

jojonki commented 6 years ago

@wlhgtc Thank you for your report. I am not sure about this. But cuda out of memory is exactly the problem of your GPU memory. Is your GPU memory sufficient? If you sum up Variables directly, this problem will happen. If so, .detach or .data may be helpful.

wlhgtc commented 6 years ago

@jojonki Thanks for your reply. I debug my code the whole day. I test my model layer by layer(I comment out the backward step and optimizer step). The "out of memory error" occur when I compute the matrix S(S=W[H,U,HºU] ) with the batch size 60(according to the paper) . But I find you config is 20. The model goes well in 30 size. Finally , I find a phenomenon in 30 size : the memory wen in 9G at first and went down for7.2G finally remain steady. I don't know how you deal with the data. I use the torchtext package, for each batch, this package will padding the context according to the max length of context automatically . I think there are some context that are too long(in some batches) so that the memory run out! So I wonder if you padding the context in batch the same as me. And why you set the batch size 20?
By the way ,I use GTX 1080Ti with Pytorch 0.3!

Vimos commented 6 years ago

Is the memory increasing in your case? Mine runs out of memory in the middle of training.

[20180625-174613] Epoch 0 74.2%, loss_p1: 3.338, loss_p2: 2.325
p1 acc: 9.000% (6077/65000), p2 acc: 10.000% (6521/65000)
 75%|████████████████████████████████████████████▋               | 3266/4379 [07:33<02:34,  7.21it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "main.py", line 237, in <module>
    train(model, train_data, optimizer, ema, start_epoch=args.start_epoch)
  File "main.py", line 153, in train
    (loss_p1+loss_p2).backward()
  File "/home/vimos/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/vimos/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu:58
wlhgtc commented 6 years ago

@Vimos There are some sentences with lenth>500. You'd better set them with a fixed length(for me 300)

Vimos commented 6 years ago

@wlhgtc Thanks for the advice. If I keep using the default value for length, I have to change to a smaller batch size of 10. This still require 7709MiB memory.

wlhgtc commented 6 years ago

If the memory keeps steady, it's fine. It's a large model.