ghost commented 7 years ago

This is probably going to be needlessly long and very confusing to read but here are some things I noticed while training this model.

3 raised the issue of overfitting, where the training error and training EM/F1 is very high but the dev loss and EM/F1 are not improving. I also noticed that the model overfits by a large amount when no regularization is used. The figures below show how well the model optimizes to the trainset. However when I run evaluation the EM/F1 is about 30/40.

screenshot from 2017-10-16 10-52-20 I've found several ways to apply regularization in this model architecture.

Normal feed forward dropout between RNN layers as suggested in the paper: Doesn't work. The model doesn't train well and has high training error (around ~= 2) devset EM/F1 is still around 30/40.
Recurrent Dropout within RNN cells: There are many ways to do this (at least 3 verified ways). I tried 2 different ways each described in A Theoretically Grounded Application of Dropout in Recurrent Neural Networks and Recurrent Dropout without Memory Loss. The first approach of sampling dropout mask only once and using it across the hidden state connection betwen cells = DOESN'T WORK (The model doesn't learn anything AT ALL possibly due to the large memory loss over the long RNN sequence). The latter approach works better than both feed forward dropout and recurrent state dropout (first approach), but still not as good. (Check the second figure) devset EM/F1 is about 40/50
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations: While searching for better regularization, I came across this paper released by MILA with Yoshua Bengio co-authoring the paper. I implemented it in here and the result seems to work remarkably better than any other regularization technique I tried. But still not quite there yet. I also decreased the hidden state size from 75 to 68. (Check the second figure). devset EM/F1 is about 50/60
Maybe something is wrong with the data?: After finding out even Zoneout can't solve the regularization problem, I read this paper released from Google UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION. It says that given enough parameters (if number of parameters > number of training data) it will learn to overfit to any training data no matter how noisy the data is. It is possible to give random noise as training set and still obtain 0 training error. (Although dev and test loss will be no better than just a random guess.) This made me think that maybe my data label is wrong (raised by issue #11 ). So I will work on fixing this problem now.

This one below shows the difference between recurrent dropout and and zoneout technique. screenshot from 2017-10-16 11-17-09

If you have any suggestions, similar problems, any other problems you found while training please let me know because every contribution helps me learn and improves this repo. :)

sathishreddy commented 7 years ago

Hi minsangkim1 Thank you very much for sharing this work. Could you please share hyper params used to get dev EM of 55. in the params.py file both the zone_out and droput set to none. But you said you got EM;55 with zoneout. Thanks, Sathish

ghost commented 7 years ago

Hi @sathishreddy, shortly after writing this summary I obtained EM/F1 = 53/65 using the following hyper parameters. I used SRU (Simple recurrent unit) to reduce the number of parameters, and increase the convergence speed. I also tried dropout = None, zoneout = 0.1, attn_size = 54, SRU = True, and obtained similar results (EM/F1 = 50/63) with less training time. Please do try and let us know if you find any better sets of hyper parameters.

sathishreddy commented 7 years ago

Thanks @minsangkim142.

matthew-z commented 7 years ago

Hi, I saw that you closed the issue. I was wondering if you got a better solution on this over-fitting problem?

FYI: I also ended up with a similar score (about 55/65) with my pytorch implementation after 8 epochs.

ghost commented 7 years ago

Hi @matthew-z, I closed this issue as I haven't received enough feedback from it. Ultimately I haven't gotten much better than EM/F1 55/67,

Putting dropouts between pretty much every layer possible seems to improve the performance by 1~2%,
Using adadelta instead of adam seems to improve the performance by 1%

but after that it seems difficult to get over the performance barrier. I believe the key implementation details is missing in order to achieve the performance suggested by the original paper. Also keep in mind that papers competing in competitions (like SQuAD) are likely to omit some small implementation details that are essential to reproducing the original results. It is also possible that my implementation has bugs.

matthew-z commented 7 years ago

@minsangkim142

I see. Thank a lot for your reply!

I will try to ask the authors next week (we are in the same building).

ghost commented 7 years ago

@matthew-z that would be awesome. Thanks!

localminimum / R-net

Overfitting? #12