Overperformance in results when trying to replicate paper

Hope you've been having a great day!

I've been trying to replicate the results of your paper "How Optimal is Greedy Decoding for Extractive Question Answering?", specifically the performance using T5-base and regular greedy decoding. I have not modified the code. I'm noticing higher performance than what is written in the paper's appendix. I've attached a photo of the results we found, taking the avg +/- std of the F1 scores across random seeds 0-4. The numbers in red denote the experiments where the paper's result fell more than three standard deviations below our replicated result.

I was just curious if I was doing something wrong, possibly in my hyperparameter choice? Below are the hyperparameters I'm using:

python src/model.py \ --batch_size=2 \ --accumulate_grad_batches=16 \ --gpus=1 \ --seed=0 \ --splinter_data='./data' \ --cache_dir='./cache' \ --tags=comma_sep_nuptune_tags \ --max_steps=256 \ --optimizer='adafactor_const' \ --check_val_every_n_steps=0 \ --train_samples=${train_samples} \ --exp_name="${dataset}_train_samples_${train_samples}" \ --val_batch_size=16 \ --model_name=google/t5-v1_1-base \ --tokenizer=google/t5-v1_1-base \ --log_every_n_steps=16 \ --lr=5e-05 \ --pattern='Text: <context> Question: <question> Answer:<mask>.' \ --dataset=${dataset} \ --check_val_every_n_epoch=9999 \ --results_path="results/results_${dataset}_train_samples_${train_samples}.jsonl" \ --test_samples=-1 \ --num_nodes=1 \ --decode_greedy

I really appreciate your help!

Hi @janepan9917 thanks for showing interest in the project!

For base I used a different configuration: 8 gpus, batch size 2, ddp (for multiple gpus training), accumulate_grad_batches 2 for >16 train samples, and most importantly 512 steps rather than your 256. The overall improved results you're getting might suggest the base model is overfitting with 512 steps, which does not happen (or happen to lesser extent) with 256 steps. When performing hyperparams search for T5-base (as in T5-large) I used SQuAD dataset only for methodological reasons, as described in section 4 of the paper. It is very likely there's lots of room for improvement in T5-base performance for this few shot task - it was actually demonstrated in other related work.

ocastel / exact-extract

Overperformance in results when trying to replicate paper #1