Closed ghost closed 7 years ago
Hi minsangkim1 Thank you very much for sharing this work. Could you please share hyper params used to get dev EM of 55. in the params.py file both the zone_out and droput set to none. But you said you got EM;55 with zoneout. Thanks, Sathish
Hi @sathishreddy, shortly after writing this summary I obtained EM/F1 = 53/65 using the following hyper parameters. I used SRU (Simple recurrent unit) to reduce the number of parameters, and increase the convergence speed. I also tried dropout = None, zoneout = 0.1, attn_size = 54, SRU = True, and obtained similar results (EM/F1 = 50/63) with less training time. Please do try and let us know if you find any better sets of hyper parameters.
Thanks @minsangkim142.
Hi, I saw that you closed the issue. I was wondering if you got a better solution on this over-fitting problem?
FYI: I also ended up with a similar score (about 55/65) with my pytorch implementation after 8 epochs.
Hi @matthew-z, I closed this issue as I haven't received enough feedback from it. Ultimately I haven't gotten much better than EM/F1 55/67,
but after that it seems difficult to get over the performance barrier. I believe the key implementation details is missing in order to achieve the performance suggested by the original paper. Also keep in mind that papers competing in competitions (like SQuAD) are likely to omit some small implementation details that are essential to reproducing the original results. It is also possible that my implementation has bugs.
@minsangkim142
I see. Thank a lot for your reply!
I will try to ask the authors next week (we are in the same building).
@matthew-z that would be awesome. Thanks!
This is probably going to be needlessly long and very confusing to read but here are some things I noticed while training this model.
3 raised the issue of overfitting, where the training error and training EM/F1 is very high but the dev loss and EM/F1 are not improving. I also noticed that the model overfits by a large amount when no regularization is used. The figures below show how well the model optimizes to the trainset. However when I run evaluation the EM/F1 is about 30/40.
I've found several ways to apply regularization in this model architecture.
This one below shows the difference between recurrent dropout and and zoneout technique.
If you have any suggestions, similar problems, any other problems you found while training please let me know because every contribution helps me learn and improves this repo. :)