Open ahnjaewoo opened 3 years ago
We tried to reproduce the baselines for the NLVR2 task. But our result was off by a visible margin.
Graphic Card : Quadro RTX 6000 CUDA version : 10.1
python oscar/run_nlvr.py \ -j 4 \ --img_feature_dim 2054 \ --max_img_seq_length 40 \ --data_dir vinvl/datasets/nlvr2 \ --model_type bert \ --model_name_or_path vinvl/model_ckpts/vqa/base/checkpoint-2000000 \ --task_name nlvr \ --do_lower_case \ --max_seq_length 55 \ --per_gpu_eval_batch_size 64 \ --per_gpu_train_batch_size 72 \ --learning_rate 3e-05 \ --num_train_epochs 20 \ --img_feature_type faster_r-cnn \ --data_label_type all \ --train_data_type all \ --eval_data_type all \ --loss_type xe \ --save_epoch -1 \ --seed 88 \ --evaluate_during_training \ --logging_steps -1 \ --drop_out 0.3 \ --do_train \ --do_eval \ --do_test \ --weight_decay 0.05 \ --classifier mlp \ --cls_hidden_scale 3 \ --num_choice 2 \ --use_pair \ --warmup_steps 10000 \
[{"epoch": 0, "eval_score": 0.6765969636207391, "best_score": 0.6765969636207391}, {"epoch": 1, "eval_score": 0.7268690919507305, "best_score": 0.7268690919507305}, {"epoch": 2, "eval_score": 0.7399026067029505, "best_score": 0.7399026067029505}, {"epoch": 3, "eval_score": 0.7731309080492695, "best_score": 0.7731309080492695}, {"epoch": 4, "eval_score": 0.7857347464909767, "best_score": 0.7857347464909767}, {"epoch": 5, "eval_score": 0.7959037525064452, "best_score": 0.7959037525064452}, {"epoch": 6, "eval_score": 0.8027785734746491, "best_score": 0.8027785734746491}, {"epoch": 7, "eval_score": 0.8040676024061874, "best_score": 0.8040676024061874}, {"epoch": 8, "eval_score": 0.8037811515325122, "best_score": 0.8040676024061874}, {"epoch": 9, "eval_score": 0.8057863076482383, "best_score": 0.8057863076482383}, {"epoch": 10, "eval_score": 0.8014895445431108, "best_score": 0.8057863076482383}, {"epoch": 11, "eval_score": 0.8082211400744772, "best_score": 0.8082211400744772}, {"epoch": 12, "eval_score": 0.810369521627041, "best_score": 0.810369521627041}, {"epoch": 13, "eval_score": 0.8125179031796047, "best_score": 0.8125179031796047}, {"epoch": 14, "eval_score": 0.8089372672586651, "best_score": 0.8125179031796047}, {"epoch": 15, "eval_score": 0.812088226869092, "best_score": 0.8125179031796047}, {"epoch": 16, "eval_score": 0.8109424233743913, "best_score": 0.8125179031796047}, {"epoch": 17, "eval_score": 0.8077914637639645, "best_score": 0.8125179031796047}, {"epoch": 18, "eval_score": 0.8073617874534518, "best_score": 0.8125179031796047}, {"epoch": 19, "eval_score": 0.810369521627041, "best_score": 0.8125179031796047}]
{"best_score": 0.8169944021817138}
We get the best score of 81.25 for the dev set and 81.7 for the test-p set, while the baseline for this task itself gives 82.05 for the dev set and 83.08 for the test-p set.
Thanks in advance!
We tried to reproduce the baselines for the NLVR2 task. But our result was off by a visible margin.
Hardware Specifications
Graphic Card : Quadro RTX 6000 CUDA version : 10.1
Command Given
Evaluation Result (Dev set)
Evaluation Result (Test-P set)
We get the best score of 81.25 for the dev set and 81.7 for the test-p set, while the baseline for this task itself gives 82.05 for the dev set and 83.08 for the test-p set.
Thanks in advance!