Why are my opt-6.7B results like this?

This is my script:

CUDA_VISIBLE_DEVICES=2 python src/run_autoreg.py \
    --do_predict \
    --predict_with_generate \
    --evaluation_strategy "no" \
    --model_name_or_path "facebook/opt-6.7b" \
    --max_source_length 1024 \
    --max_target_length 64 \
    --generation_max_length 64 \
    --max_num_instances_per_task 100 \
    --max_num_instances_per_eval_task 100 \
    --add_task_name False \
    --add_task_definition True \
    --num_pos_examples 0 \
    --num_neg_examples 0 \
    --add_explanation False \
    --tk_instruct False \
    --data_dir data/splits/default \
    --task_dir data/tasks \
    --output_dir $output_dir \
    --overwrite_output_dir \
    --cache_dir ./cache/ \
    --per_device_eval_batch_size 16 \
    --icil True \
    --demo_path demos/ICIL/ICIL_seed1.json \
    --report_to tensorboard \
    --adaptive True

And this is my result:

======== Overall Metrics ========
all_rougeL 18.3453
all_EM 13.7172

======== Metrics per category ========
exact_match_for_textual_entailment 25.75
exact_match_for_cause_effect_classification 32.4286
exact_match_for_coreference_resolution 13.2143
exact_match_for_dialogue_act_recognition 17.7143
exact_match_for_answerability_classification 26.6923
exact_match_for_word_analogy 1.5
rougeL_for_overlap_extraction 11.9681
rougeL_for_keyword_tagging 11.9758
rougeL_for_question_rewriting 2.8998
rougeL_for_title_generation 8.5752
rougeL_for_data_to_text 12.3683
rougeL_for_grammar_error_correction 19.3007

However, Figure 1 in the paper shows an ICIL result of 25.75 for opt-6.7B. How was this value of 25.75 obtained? Should Rouge-L and EM be averaged?

seonghyeonye / TAPP

Why are my opt-6.7B results like this? #4