openai / finetune-transformer-lm

Code and model for the paper "Improving Language Understanding by Generative Pre-Training"
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
MIT License
2.14k stars 499 forks source link

Cannot reproduce RACE score #11

Closed sugiyama-hiroaki closed 6 years ago

sugiyama-hiroaki commented 6 years ago

Hi,

We tried several settings based on your code and paper, but unfortunately we cannot reproduce the RACE score (the training loss decreases, but the dev accuracy only reaches 0.26). Could you tell me some tips about the parameter or code modification for achieving the performance?

Thanks

hardik2396 commented 6 years ago

Any updates @sugiyama-hiroaki ? I also trying to reproduce RACE results.

sugiyama-hiroaki commented 6 years ago

It works! I had a simple bug on converting problems to tensor format.

hardik2396 commented 6 years ago

How much accuracy did you get on RACE @sugiyama-hiroaki ? I am getting 53.

sugiyama-hiroaki commented 6 years ago

I had 57.2. This is slightly lower than the paper score, but I think it's a reasonable level. In our experiment, it depends on the batch size. We got 57.2 with 32 batch size and 54.8 with 4 batch size.

hardik2396 commented 6 years ago

I am trying but due to the low number of GPU, I cannot reproduce the results. I would be very grateful to you if you can provide me pre-trained weights. @sugiyama-hiroaki

hardik2396 commented 6 years ago

Can you share the code for RACE ? @sugiyama-hiroaki

sugiyama-hiroaki commented 6 years ago

I'm so sorry but I cannot share my weights because I rewrote the code for adapting to RACE. Besides, I'm a company researcher so I don't have a permission to share my code.

If you don't have enough number of GPUs, you can train the model with skipping the update function until it reaches the batch size.

chengchingwen commented 6 years ago

@sugiyama-hiroaki Thanks for the tips! I'm also trying to reimplement for adapting to RACE, but I can't get higher than 35% acc on dev & test set and 58% acc on training set. I use batch_size=4 and all the others are left as default. Could you please tell me what hyperparameters did you choose for training the model?

ap229997 commented 6 years ago

@sugiyama-hiroaki Can you tell whether the test set scores were fluctuating a lot or stable (like uniformly going up or a smooth pattern) with respect to the epochs. Also can you tell the number of epochs you trained the model for?