localminimum / R-net

A Tensorflow Implementation of R-net: Machine reading comprehension with self matching networks
MIT License
323 stars 122 forks source link

I am working on a PyTorch version #3

Closed matthew-z closed 7 years ago

matthew-z commented 7 years ago

Just want to share some ideas.

Currently, my pytorch version can work with a very small batch size (about 20 examples/batch on AWS P2 instance whose GPU has worse performance than 1080 while it has 12GB VRAM) .

Training speed is about 3 examples per second, so it takes about 10 hours to train one epoch (80,000 examples). I observed that increasing batch size can improve the training speed largely, but then I may encounter OOM error when there is a long passage in the batch.

Some ideas to improve batch size and speed:

  1. dynamic batch size. Reduce the batch size if there is a long passage.
  2. Multi-GPU model. Each GPU only needs to calculate a part of model. I think it is easy to implement on Tensorflow.
  3. More dynamic model. I tried my best to avoid padding so RNN does not need to unroll useless pad elements.
ghost commented 7 years ago

Thanks for your input and work. I figured that while R-Net requires a large gpu memory, If you implemented the R-Net correctly it is possible to fit the whole model in a single GPU with a batch size of 50 and larger. Here are some tips of implementing:

  1. Check the shape of the output at each layer (especially the attention-matching layers as they are tricky to implement on top of gated recurrent units (GRUS))
  2. Model parallelism is a very difficult topic and most of the multi gpu training is happening in data parallelism. For R-net purpose, I think there will be more gain if multi gpu is used "data-parallel" manner rather than "model-parallel" implementation. R-Net definitely can fit in a single GPU (8GB).
  3. I'm not too sure about pytorch, but in tensorflow I used variable length input to each RNN so that each training example only goes up to the length of the passage and don't compute the zero padded elements.
  4. Keep in mind of the shared weights. In the original paper, some of the weights are shared across the layers.
matthew-z commented 7 years ago

Thank you for the comments!

akaitsuki-ii commented 7 years ago

Hello, @minsangkim142 and @matthew-z . Sorry to comment under this closed issue. I am also working on an implementation of R-NET using PyTorch (https://github.com/akaitsuki-ii/r-net) and now I am suffering from terrible training overfit. Loss on dev set never go down (about 6.0+) after several epochs and the EM/F1 stays 10+/20+, while training loss can go down to 0.6 and EM/F1 becomes 80+/90+ on the training set. Have you meet such problems when training your models ? What's more, @matthew-z , my PyTorch implementation does not cost so much GPU memory and batch_size of 32 is feasible using a single GTX 1070 (8GB). I just remove examples with context has more then 300 tokens as mentioned in some papers.

theSage21 commented 7 years ago

It might help to investigate what exactly the model learns on the training set. I've always found it useful to investigate what it overfits to.

akaitsuki-ii commented 7 years ago

Emmm... I think it's a little bit hard to figure out that ... I have try to mix the official training set and dev set and resplit questions into training/dev set (thus there are many common passages among training/dev set while questions/answers are different), EM/F1 is quite good when training with this dataset. So I guess it's hard for the model to do well on unseen passages, or overfit on the training passages.

matthew-z commented 7 years ago

@akaitsuki-ii Yes, I believe 8G memory would be enough after removing the long paragraphs. 8G is enough for most batches (size = 32).

matthew-z commented 7 years ago

@akaitsuki-ii

I just had a glance at your code.

I guess you didn't freeze the pre-trained word embedding as nn.Parameter requires grad by default. Also, it seems that you didn't use multiple layers and biRNN in embedding, pair encoding, self attention, and output layer, so the model capacity may be dominated by embedding parameters instead of RNN's parameters. This could cause overfit as the dataset is relatively small.

if pretrained_word_embeddings is not None:
            word_embeddings = nn.Parameter(torch.from_numpy(pretrained_word_embeddings))

Second, if you pad the input, hmay become the hidden state of pad element while what you want is the one of the last valid element. Therefore, this part is not reliable:

 _, h = self._charRNN(cx, h0)
ghost commented 7 years ago

I just raised a new issue #12 where we can discuss how to solve the regularization problem. So we can keep this issue closed :)

akaitsuki-ii commented 7 years ago

@matthew-z Thanks for your advices! It is a bug that I forget to add a fixed_embed flag here but on another branch, I will now try to fix it and start another experiment . As for the charRNN padding problem, I will re-check that does zero padding make the result unreliable. @minsangkim142 Sorry for replying here again. If there is any update about the overfit problem, I will go to https://github.com/minsangkim142/R-net/issues/12 .