seonghyeonye / TAPP

[AAAI 2024] Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following
https://arxiv.org/abs/2302.14691
MIT License
79 stars 2 forks source link

Why are my opt-6.7B results like this? #4

Open li-aolong opened 10 months ago

li-aolong commented 10 months ago

This is my script:

CUDA_VISIBLE_DEVICES=2 python src/run_autoreg.py \
    --do_predict \
    --predict_with_generate \
    --evaluation_strategy "no" \
    --model_name_or_path "facebook/opt-6.7b" \
    --max_source_length 1024 \
    --max_target_length 64 \
    --generation_max_length 64 \
    --max_num_instances_per_task 100 \
    --max_num_instances_per_eval_task 100 \
    --add_task_name False \
    --add_task_definition True \
    --num_pos_examples 0 \
    --num_neg_examples 0 \
    --add_explanation False \
    --tk_instruct False \
    --data_dir data/splits/default \
    --task_dir data/tasks \
    --output_dir $output_dir \
    --overwrite_output_dir \
    --cache_dir ./cache/ \
    --per_device_eval_batch_size 16 \
    --icil True \
    --demo_path demos/ICIL/ICIL_seed1.json \
    --report_to tensorboard \
    --adaptive True

And this is my result:

======== Overall Metrics ========
all_rougeL 18.3453
all_EM 13.7172

======== Metrics per category ========
exact_match_for_textual_entailment 25.75
exact_match_for_cause_effect_classification 32.4286
exact_match_for_coreference_resolution 13.2143
exact_match_for_dialogue_act_recognition 17.7143
exact_match_for_answerability_classification 26.6923
exact_match_for_word_analogy 1.5
rougeL_for_overlap_extraction 11.9681
rougeL_for_keyword_tagging 11.9758
rougeL_for_question_rewriting 2.8998
rougeL_for_title_generation 8.5752
rougeL_for_data_to_text 12.3683
rougeL_for_grammar_error_correction 19.3007

However, Figure 1 in the paper shows an ICIL result of 25.75 for opt-6.7B. How was this value of 25.75 obtained? Should Rouge-L and EM be averaged?

seonghyeonye commented 10 months ago

Hi, we have found that the uploaded code had a slight issue with input processing (blank space was included in the input string which led to performance degradation).

Sorry for the inconvenience and we have updated the code. With the revised code, you should get all_rouge value of 25.59 and all_EM value of 17.14 for the first seed (ICIL_seed1.json) Note that the result of Figure 1 in the paper is the average of individual scores of all tasks across three different seeds.

Let us know if there are additional issues.