image caption task reproduce results weird

model download here: https://biglmdiag.blob.core.windows.net/oscar/exp/coco_caption/base/checkpoint.zip data download here: https://biglmdiag.blob.core.windows.net/oscar/datasets/coco_caption.zip

CUDA_VISIBLE_DEVICES=2 python oscar/run_captioning.py  --do_test  --do_eval  --test_yaml test.yaml  --per_gpu_eval_batch_size 64  --num_beams 5  --max_gen_length 20  --eval_model_dir /data/model/Oscar/checkpoint-29-66420/  --data_dir /data/image_cap/coco_caption/

results log

loading annotations into memory...
0:00:00.041951
creating index...
index created!
Loading and preparing results...     
DONE (t=0.01s)
creating index...
index created!
tokenization...
PTBTokenizer tokenized 307086 tokens at 1218855.64 tokens per second
Dec 10, 2021 11:49:34 AM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 〜 (U+301C, decimal: 12316)
PTBTokenizer tokenized 79431 tokens at 442892.03 tokens per second.
setting up scorers...
computing Bleu score...
{'testlen': 74351, 'reflen': 60492, 'guess': [74351, 69351, 64351, 59369], 'correct': [10, 0, 0, 0]}
ratio: 1.2291046749983265
Bleu_1: 0.000
Bleu_2: 0.000
Bleu_3: 0.000
Bleu_4: 0.000
computing METEOR score...
METEOR: 0.003
computing Rouge score...
ROUGE_L: 0.000
computing CIDEr score...
CIDEr: 0.000
computing SPICE score...
Parsing reference captions
Parsing test captions
SPICE evaluation took: 5.848 s
SPICE: 0.001

Did I do something wrong?

model download here: https://biglmdiag.blob.core.windows.net/oscar/exp/coco_caption/base/checkpoint.zip data download here: https://biglmdiag.blob.core.windows.net/oscar/datasets/coco_caption.zip

CUDA_VISIBLE_DEVICES=2 python oscar/run_captioning.py  --do_test  --do_eval  --test_yaml test.yaml  --per_gpu_eval_batch_size 64  --num_beams 5  --max_gen_length 20  --eval_model_dir /data/model/Oscar/checkpoint-29-66420/  --data_dir /data/image_cap/coco_caption/

results log

loading annotations into memory...
0:00:00.041951
creating index...
index created!
Loading and preparing results...     
DONE (t=0.01s)
creating index...
index created!
tokenization...
PTBTokenizer tokenized 307086 tokens at 1218855.64 tokens per second
Dec 10, 2021 11:49:34 AM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 〜 (U+301C, decimal: 12316)
PTBTokenizer tokenized 79431 tokens at 442892.03 tokens per second.
setting up scorers...
computing Bleu score...
{'testlen': 74351, 'reflen': 60492, 'guess': [74351, 69351, 64351, 59369], 'correct': [10, 0, 0, 0]}
ratio: 1.2291046749983265
Bleu_1: 0.000
Bleu_2: 0.000
Bleu_3: 0.000
Bleu_4: 0.000
computing METEOR score...
METEOR: 0.003
computing Rouge score...
ROUGE_L: 0.000
computing CIDEr score...
CIDEr: 0.000
computing SPICE score...
Parsing reference captions
Parsing test captions
SPICE evaluation took: 5.848 s
SPICE: 0.001

Did I do something wrong?

when I finetune model using code below, I got better result (B@4: 35.7, M: 29.5, C: 121.2, S: 22.4). Still can't reproduct results in readme.

CUDA_VISIBLE_DEVICES=2 python oscar/run_captioning.py --model_name_or_path /data/model/Oscar/checkpoint-29-66420 --do_train --do_lower_case --evaluate_during_training --add_od_labels --learning_rate 0.000005 --per_gpu_train_batch_size 64 --num_train_epochs 5 --save_steps 2000 --output_dir /data/model/Oscar/finetuning/ --train_yaml /data/data/image_cap/coco_caption/train.yaml --data_dir /data/data/image_cap/coco_caption/ --val_yaml /data/data/image_cap/coco_caption/val.yaml

microsoft / Oscar

image caption task reproduce results weird #174