microsoft / Oscar

Oscar and VinVL
MIT License
1.04k stars 251 forks source link

Unable to Reproduce the Baseline results for NLVR2 task #130

Open ahnjaewoo opened 3 years ago

ahnjaewoo commented 3 years ago

We tried to reproduce the baselines for the NLVR2 task. But our result was off by a visible margin.

Hardware Specifications

Graphic Card : Quadro RTX 6000 CUDA version : 10.1

Command Given

python oscar/run_nlvr.py \
    -j 4 \
    --img_feature_dim 2054 \
    --max_img_seq_length 40 \
    --data_dir vinvl/datasets/nlvr2 \
    --model_type bert \
    --model_name_or_path vinvl/model_ckpts/vqa/base/checkpoint-2000000 \
    --task_name nlvr \
    --do_lower_case \
    --max_seq_length 55 \
    --per_gpu_eval_batch_size 64 \
    --per_gpu_train_batch_size 72 \
    --learning_rate 3e-05 \
    --num_train_epochs 20 \
    --img_feature_type faster_r-cnn \
    --data_label_type all \
    --train_data_type all \
    --eval_data_type all \
    --loss_type xe \
    --save_epoch -1 \
    --seed 88 \
    --evaluate_during_training \
    --logging_steps -1 \
    --drop_out 0.3 \
    --do_train \
    --do_eval \
    --do_test \
    --weight_decay 0.05 \
    --classifier mlp \
    --cls_hidden_scale 3 \
    --num_choice 2 \
    --use_pair \
    --warmup_steps 10000 \

Evaluation Result (Dev set)

[{"epoch": 0, "eval_score": 0.6765969636207391, "best_score": 0.6765969636207391}, {"epoch": 1, "eval_score": 0.7268690919507305, "best_score": 0.7268690919507305}, {"epoch": 2, "eval_score": 0.7399026067029505, "best_score": 0.7399026067029505}, {"epoch": 3, "eval_score": 0.7731309080492695, "best_score": 0.7731309080492695}, {"epoch": 4, "eval_score": 0.7857347464909767, "best_score": 0.7857347464909767}, {"epoch": 5, "eval_score": 0.7959037525064452, "best_score": 0.7959037525064452}, {"epoch": 6, "eval_score": 0.8027785734746491, "best_score": 0.8027785734746491}, {"epoch": 7, "eval_score": 0.8040676024061874, "best_score": 0.8040676024061874}, {"epoch": 8, "eval_score": 0.8037811515325122, "best_score": 0.8040676024061874}, {"epoch": 9, "eval_score": 0.8057863076482383, "best_score": 0.8057863076482383}, {"epoch": 10, "eval_score": 0.8014895445431108, "best_score": 0.8057863076482383}, {"epoch": 11, "eval_score": 0.8082211400744772, "best_score": 0.8082211400744772}, {"epoch": 12, "eval_score": 0.810369521627041, "best_score": 0.810369521627041}, {"epoch": 13, "eval_score": 0.8125179031796047, "best_score": 0.8125179031796047}, {"epoch": 14, "eval_score": 0.8089372672586651, "best_score": 0.8125179031796047}, {"epoch": 15, "eval_score": 0.812088226869092, "best_score": 0.8125179031796047}, {"epoch": 16, "eval_score": 0.8109424233743913, "best_score": 0.8125179031796047}, {"epoch": 17, "eval_score": 0.8077914637639645, "best_score": 0.8125179031796047}, {"epoch": 18, "eval_score": 0.8073617874534518, "best_score": 0.8125179031796047}, {"epoch": 19, "eval_score": 0.810369521627041, "best_score": 0.8125179031796047}]

Evaluation Result (Test-P set)

{"best_score": 0.8169944021817138}

We get the best score of 81.25 for the dev set and 81.7 for the test-p set, while the baseline for this task itself gives 82.05 for the dev set and 83.08 for the test-p set.

Thanks in advance!