This is the Source Code of Paper: GPTScore: Evaluate as You Desire.
GPTScore is a novel evaluation framework that utilizes the emergent abilities (e.g., zero-shot instruction) of Generative Pre-Trained models to Score generated texts.
GPTScore evaluation framework support:
We explored 19 Pre-trained Language Models (PLMs) ranging in size from 80M (FLAN-T5-Small) to 175B (GPT3) to design GPTScore.
The PLMs studied in this paper are listed as follows:
Model | Parameter | Evaluator Name | Model | Parameter | Evaluator Name |
---|---|---|---|---|---|
GPT3 | OPT | ||||
text-ada-001 | 350M | gpt3_score | OPT350M | 350M | opt350m_score |
text-babbage-001 | 1.3B | gpt3_score | OPT-1.3B | 1.3B | opt1_3B_score |
text-curie-001 | 6.7B | gpt3_score | OPT-6.7B | 6.7B | opt6_7B_score |
text-davinci-001 | 175B | gpt3_score | OPT-13B | 13B | opt13B_score |
text-davinci-003 | 175B | gpt3_score | OPT-66B | 66B | opt66B_score |
FLAN-T5 | GPT2 | ||||
FT5-small | 80M | flan_small_score | GPT2-M | 355M | gpt2_medium_score |
FT5-base | 250M | flan_base_score | GPT2-L | 774M | gpt2_large_score |
FT5-L | 770M | flan_large_score | GPT2-XL | 1.5B | gpt2_xl_score |
FT5-XL | 3B | flan_xl_score | GPT-J-6B | 6B | gptJ6B_score |
FT5-XXL | 11B | flan_xxl_score |
Take the evaluation of GPT3-text-curie-001
model as an example.
gpt3_score
to True
: the GPTScore evaluator uses a GPT3-based PLM.gpt3model
to curie
: the text-curie-001
model is utilized.out_dir_name
: set the folder for saving scoring results.dataname
: set the dataset name for evaluation (e.g., BAGEL
).aspect
: set the aspect name to be evaluated (e.g., quality
). Set both the use_demo
and use_ist
as True
.
python score_d2t.py
--dataname "BAGEL"
--use_demo True
--use_ist True
--gpt3_score True
--gpt3model "curie"
--out_dir_name "gpt3Score_based"
--aspect 'quality'
Set the use_ist
to True
and use_demo
to False
.
python score_d2t.py
--dataname "BAGEL"
--use_demo False
--use_ist True
--gpt3_score True
--gpt3model "curie"
--out_dir_name "gpt3Score_based"
--aspect 'quality'
Set the use_ist
to False
and use_demo
to False
.
python score_d2t.py
--dataname "BAGEL"
--use_demo False
--use_ist False
--gpt3_score True
--gpt3model "curie"
--out_dir_name "gpt3Score_based"
--aspect 'quality'
Here, we take the evaluation of OPT350M
model as an example.
opt350m_score
to True
: use the evaluator named opt350m_score
. out_dir_name
: set the folder for saving scoring results.dataname
: set the dataset name for evaluation (e.g., BAGEL
).aspect
: set the aspect name to be evaluated (e.g., quality
). opt350m_score
with Instruction and DemonstrationSet both the use_demo
and use_ist
as True
.
python score_d2t.py
--dataname "BAGEL"
--use_demo True
--use_ist True
--opt350m_score True
--out_dir_name "optScore_based"
--aspect 'quality'
opt350m_score
with only InstructionSet the use_ist
to True
and use_demo
to False
.
python score_d2t.py
--dataname "BAGEL"
--use_demo False
--use_ist True
--opt350m_score True
--out_dir_name "optScore_based"
--aspect 'quality'
opt350m_score
without both Instruction and DemonstrationSet the use_ist
to False
and use_demo
to False
.
python score_d2t.py
--dataname "BAGEL"
--use_demo False
--use_ist False
--opt350m_score True
--out_dir_name "optScore_based"
--aspect 'quality'
@article{fu2023gptscore,
title={GPTScore: Evaluate as You Desire},
author={Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei},
journal={arXiv preprint arXiv:2302.04166},
year={2023}
}