google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.24k stars 569 forks source link

impossible to reproduce glue result for mnli-m. did i do something wrong? #56

Closed seongwook-ham closed 4 years ago

seongwook-ham commented 4 years ago

when fintuning albert v2 base model on mnli-m task, dev accuracy expected to be around 84.6. but i got mean accuracy 83.6 of 3 runs i use albert v2 tar file to finetune to mnli-m instead of tf_hub module i use run_classifier.py and argument is following which is same as albert paper appendix A.2 --do_lower_case=True --max_seq_length=512 --optimizer=adamw --task_name=MNLI --warmup_step=1000 --learning_rate=3e-5 --train_step=10000 --save_checkpoints_steps=100 --train_batch_size=128 --use_tpu=True --do_train=True --do_eval=True --do_predict=False did i do something wrong? in squad1.1, 2.0, sst-2, i succeed to reproduce result.

Danny-Google commented 4 years ago

Do you use early stopping (evaluate checkpoint for every 100 steps and report the best results on dev)? If so, maybe the number I reported happens to be higher than the mean. I only ran it once and reported the best dev results from all the saved checkpoints.

seongwook-ham commented 4 years ago

yes i use early stopping and save_checkpoints_steps=100

MichaelZhouwang commented 4 years ago

I've followed the hyper parameters presented in the paper but only got accuracy of ~83.0. Have you figured out how to get the reported result? Thanks!

Danny-Google commented 4 years ago

Can you try it again with this tutorial (https://github.com/google-research/albert/blob/master/ALBERT_GLUE_fine_tuning_tutorial.ipynb)? I ran it yesterday and got 84.3. Please make sure you set the max_seq_length to be 512.