Has anyone reproduced SQuAD 1.1 score(90.2/83.2) on albert-base V2?

YJYJLee commented 4 years ago

Hi, I downloaded pre-trained ALBERT base V2 model at the link in README.md and tried to fine-tune it on SQuAD 1.1 dataset without using albert hub module. However, I got f1=16.14 and exact match=7.34 as my final result, which is significantly lower than scores(90.2/83.2) reported at README.md.

Here is the command that I used for fine-tuning

ALBERT_ROOT is the directory path where I keep my albert-base-v2 model
train_feature_file, predict_feature_file, predict_feature_left_file were created in SQUAD_DIR after I ran the following command

python -m run_squad_v1 \ --albert_config_file="${ALBERT_ROOT}/albert_config.json" \ --output_dir=./output_base_v2/SQUAD \ --train_file="$SQUAD_DIR/train-v1.1.json" \ --predict_file="$SQUAD_DIR/dev-v1.1.json" \ --train_feature_file="$SQUAD_DIR/train.tfrecord" \ --predict_feature_file="$SQUAD_DIR/dev.tfrecord" \ --predict_feature_left_file="$SQUAD_DIR/pred_left_file.pkl" \ --init_checkpoint="" \ --spm_model_file="${ALBERT_ROOT}/30k-clean.model" \ --do_lower_case \ --max_seq_length=384 \ --doc_stride=128 \ --max_query_length=64 \ --do_train=true \ --do_predict=true \ --train_batch_size=48 \ --predict_batch_size=8 \ --learning_rate=5e-5 \ --num_train_epochs=2.0 \ --warmup_proportion=.1 \ --save_checkpoints_steps=5000 \ --n_best_size=20 \ --max_answer_length=30

seongwook-ham commented 4 years ago

you should set init check point as ${ALBERT_ROOT}/checkpointfile_name in setting above, you do not use pretrained model. you train model on squad from scratch

YJYJLee commented 4 years ago

Thank you so much! I fixed the problem.

I have f1 score 90.29634675 for squad 1.1, and this is the final command for finetuning squad : python -m run_squad_v1 \ --albert_config_file="${ALBERT_ROOT}/albert_config.json" \ --output_dir=./output_base_v2/SQUAD \ --train_file="/home/yejin/squad_data_v1.1/train-v1.1.json" \ --predict_file="/home/yejin/squad_data_v1.1/dev-v1.1.json" \ --train_feature_file="/home/yejin/squad_data_v1.1/train.tfrecord" \ --predict_feature_file="/home/yejin/squad_data_v1.1/dev.tfrecord" \ --predict_feature_left_file="/home/yejin/squad_data_v1.1/pred_left_file.pkl" \ --init_checkpoint="${ALBERT_ROOT}/model.ckpt-best" \ --spm_model_file="${ALBERT_ROOT}/30k-clean.model" \ --do_lower_case \ --max_seq_length=384 \ --doc_stride=128 \ --max_query_length=64 \ --do_train=true \ --do_predict=true \ --train_batch_size=48 \ --predict_batch_size=8 \ --learning_rate=5e-5 \ --num_train_epochs=2.0 \ --warmup_proportion=.1 \ --save_checkpoints_steps=5000 \ --n_best_size=20 \ --max_answer_length=30

wenhuchen commented 4 years ago

@YJYJLee May I ask you what is the memory size of your GPU? my RTX with 24GB memory can only train the model with batch_size=16, which leads comprised F1 of 77 significantly lower than yours.

YuHengKit commented 4 years ago

@YJYJLee is that possible we just evalute the model without train it with squad dataset?

YJYJLee commented 4 years ago

@wenhuchen sorry for the late response. My GPU is Quadro GV100 with 32GB memory.

YJYJLee commented 4 years ago

@YuHengKit You can evaluate the model, but it will produce a terrible result without training. You need to fine-tune the model with the above script to get 90.29634675 f1 score.

YuHengKit commented 4 years ago

@YJYJLee thanks for reply. do u try with square v2 with same configuration? I tried on colab with TPU which obtained only f1 =75.19.

YJYJLee commented 4 years ago

I finetuned squadv2 with below script and got 84.420. python3 -m run_squad_v2 \ --albert_config_file="${ALBERT_ROOT}/albert_config.json" \ --output_dir="gs://albert_finetuning/SQUAD2_large_v2_output" \ --train_file="/home/who/squad_data_v2.0/train-v2.0.json" \ --predict_file="/home/who/squad_data_v2.0/dev-v2.0.json" \ --train_feature_file="gs://albert_finetuning/squad_data_v2.0/albert_large_v2/train.tfrecord" \ --predict_feature_file="gs://albert_finetuning/squad_data_v2.0/albert_large_v2/dev.tfrecord" \ --predict_feature_left_file="gs://albert_finetuning/squad_data_v2.0/albert_large_v2/pred_left_file.pkl" \ --init_checkpoint="gs://albert_finetuning/albert_large/model.ckpt-best" \ --spm_model_file="${ALBERT_ROOT}/30k-clean.model" \ --do_lower_case \ --task_name="squad2" \ --max_seq_length=512 \ --doc_stride=128 \ --max_query_length=64 \ --do_train \ --do_predict \ --train_batch_size=48 \ --predict_batch_size=8 \ --learning_rate=5e-5 \ --num_train_epochs=2.0 \ --warmup_proportion=.1 \ --save_checkpoints_steps=5000 \ --n_best_size=20 \ --max_answer_length=30 \ --use_tpu \ --tpu_name=$tpu_name \ --num_tpu_cores=8 \ --tpu_zone=us-central1-b

google-research / albert

Has anyone reproduced SQuAD 1.1 score(90.2/83.2) on albert-base V2? #132