Closed matthew-z closed 7 years ago
Thanks for your input and work. I figured that while R-Net requires a large gpu memory, If you implemented the R-Net correctly it is possible to fit the whole model in a single GPU with a batch size of 50 and larger. Here are some tips of implementing:
Thank you for the comments!
Hello, @minsangkim142 and @matthew-z . Sorry to comment under this closed issue. I am also working on an implementation of R-NET using PyTorch (https://github.com/akaitsuki-ii/r-net) and now I am suffering from terrible training overfit. Loss on dev set never go down (about 6.0+) after several epochs and the EM/F1 stays 10+/20+, while training loss can go down to 0.6 and EM/F1 becomes 80+/90+ on the training set. Have you meet such problems when training your models ? What's more, @matthew-z , my PyTorch implementation does not cost so much GPU memory and batch_size of 32 is feasible using a single GTX 1070 (8GB). I just remove examples with context has more then 300 tokens as mentioned in some papers.
It might help to investigate what exactly the model learns on the training set. I've always found it useful to investigate what it overfits to.
Emmm... I think it's a little bit hard to figure out that ... I have try to mix the official training set and dev set and resplit questions into training/dev set (thus there are many common passages among training/dev set while questions/answers are different), EM/F1 is quite good when training with this dataset. So I guess it's hard for the model to do well on unseen passages, or overfit on the training passages.
@akaitsuki-ii Yes, I believe 8G memory would be enough after removing the long paragraphs. 8G is enough for most batches (size = 32).
@akaitsuki-ii
I just had a glance at your code.
I guess you didn't freeze the pre-trained word embedding as nn.Parameter requires grad by default. Also, it seems that you didn't use multiple layers and biRNN in embedding, pair encoding, self attention, and output layer, so the model capacity may be dominated by embedding parameters instead of RNN's parameters. This could cause overfit as the dataset is relatively small.
if pretrained_word_embeddings is not None:
word_embeddings = nn.Parameter(torch.from_numpy(pretrained_word_embeddings))
Second, if you pad the input, h
may become the hidden state of pad element while what you want is the one of the last valid element. Therefore, this part is not reliable:
_, h = self._charRNN(cx, h0)
I just raised a new issue #12 where we can discuss how to solve the regularization problem. So we can keep this issue closed :)
@matthew-z Thanks for your advices! It is a bug that I forget to add a fixed_embed flag here but on another branch, I will now try to fix it and start another experiment . As for the charRNN padding problem, I will re-check that does zero padding make the result unreliable. @minsangkim142 Sorry for replying here again. If there is any update about the overfit problem, I will go to https://github.com/minsangkim142/R-net/issues/12 .
Just want to share some ideas.
Currently, my pytorch version can work with a very small batch size (about 20 examples/batch on AWS P2 instance whose GPU has worse performance than 1080 while it has 12GB VRAM) .
Training speed is about 3 examples per second, so it takes about 10 hours to train one epoch (80,000 examples). I observed that increasing batch size can improve the training speed largely, but then I may encounter OOM error when there is a long passage in the batch.
Some ideas to improve batch size and speed: