shmsw25 / FActScore

A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation"
https://arxiv.org/abs/2305.14251
MIT License
275 stars 40 forks source link
emnlp2023 evaluation factuality language language-modeling

FActScore

made-with-python arxiv PyPI version factscore Downloads

This is the official release accompanying our EMNLP 2023 paper, FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. FActScore is available as a PIP package as well.

If you find FActScore useful, please cite:

@inproceedings{ factscore,
    title={ {FActScore}: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation },
    author={ Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang Wei and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh },
    year={ 2023 },
    booktitle = { EMNLP },
    url={ https://arxiv.org/abs/2305.14251 }
}

Announcement

Install

Make a new Python 3.7+ environment using virtualenv or conda.

pip install --upgrade factscore
python -m spacy download en_core_web_sm

Download the data

python -m factscore.download_data --llama_7B_HF_path "llama-7B"

This command does the following.

  1. Download the knowledge source and example data.
  2. Take the LLAMA 7B model and reconstruct Inst-LLAMA. This requires having access to HuggingFace weights of the LLAMA-7B model, which are added to the --llama_7B_HF_path flag. Follow this guide in order to obtain those weights. Skip the --llama_7B_HF_path if you would only like to use the ChatGPT version of FActScore.

Optional flags:

Troubleshooting:

Running FActScore using a command line

We expect running FActScore costs about $1 of the API cost per 100 sentences. For instance, if you have 100 generations, each with 5 sentences on average, it costs $5 in total.

python -m factscore.factscorer --input_path {input_path} --model_name {estimator_name} --openai_key {openai_key}

Optional flags:

To evaluate your own LM

There're two sets of prompt entities, data/labeled/prompt_entities.txt (183 entities) and data/unlabeled/prompt_entities.txt (500 entities). Each line contains the name of the person (which is also a corresponding Wikipedia title). You can use the labeled version if you want to be compatible with the data under data/labeled (Section 3 and Section 4.2 in the paper), and use the unlabeled version if you want to be compatible with the data under data/unlabeled (Section 4.3 in the paper).

You can prompt your LM with your own prompt (we used Question: Tell me a bio of <entity>.) and use the following code.

from factscore.factscorer import FactScorer

fs = FactScorer(openai_key="...")

# topics: list of strings (human entities used to generate bios)
# generations: list of strings (model generations)
out = fs.get_score(topics, generations, gamma=10)
print (out["score"]) # FActScore
print (out["init_score"]) # FActScore w/o length penalty
print (out["respond_ratio"]) # % of responding (not abstaining from answering)
print (out["num_facts_per_response"]) # average number of atomic facts per response

Alternatively, you can create a .jsonl file, where each line has topic (entity name, exactly same as the one from .txt file) and output (generation from LM), and then use a command line above.

We recommend using (A) FactScorer(model_name="retrieval+ChatGPT") (default) or (B) FactScorer(model_name="retrieval+llama+npm"). They have 0.99 Pearson correlation. Here're results of a range of models, which you can easily reproduce through these command lines.

Model % respond # facts FActScore from (A) FActScore from (B)
GPT-4 88.2 60.8 73.1 59.9
ChatGPT 84.2 37.0 71.6 60.4
Alpaca 65B 100.0 17.1 55.6 46.3
InstructGPT 99.8 27.7 52.8 41.7
Alpaca 13B 100.0 16.6 47.7 40.3
Vicuna 13B 76.6 50.9 46.6 40.7
Alpaca 7B 100.0 17.4 39.7 36.5
Vicuna 7B 91.0 45.6 38.9 36.9
MPT Chat 7B 88.8 37.3 30.1 27.9
Oasst Pythia 12B 100.0 39.7 25.1 20.8
Dolly 12B 100.0 24.6 21.7 17.1
StableLM tuned 7B 66.6 38.0 17.3 16.3

% respond (% of responding instead of abstaining from answering) and # facts (# of atomic facts per valid response) indicate "factual recall" (how many pieces of information the model gives) and FActScore indicates "factual precision" (how accurate each piece of information the model gives is).

To use a custom knowledge source

By default, FActScore uses Wikipedia dump from 2023/04/01. But you can also use your own knowledge source!

The knolwedge source should be ready in a .jsonl format, where each line is a dictionary containing title and text. text can either be a string or a list of strings (e.g., sections).

from factscore.factscorer import FactScorer

fs = FactScorer()

# this will create a database using your file
# for English Wikipedia (18GB)), it takes ~8 hours
# once DB file is created, you can reuse it by only specifying `db_path`
fs.register_knowledge_source(name_of_your_knowledge_source,
                             data_path=path_to_jsonl_file,
                             db_path=path_to_output_db_file)

# now, when you compute a score, specify knowledge source to use
out = fs.get_score(topics, generations, knowledge_source=name_of_your_knowledge_source)
print (out["score"]) # FActScore
print (out["respond_ratio"]) # % of responding (not abstaining from answering)
print (out["num_facts_per_response"]) # average number of atomic facts per response

To see an example of constructing the ACL anthology knowledge source, see preprocessing/preprocess_acl.py.

FActScore results of the unlabeled data

You can easily reproduce FActScore results of 12 different LMs reported in Section 4.3 of the paper using this code. However, if you would like to obtain their predictions without running the code, you can download it from this Google Drive link.

Each file corresponds to the subject LM (LM that generates responses that we are validating). Each line is a dictionary:

Note that the number of lines may be less than 500, because it excludes the cases where the model abstains from responding (e.g., it says "I don't know"). You can do # of lines / 500 to calculate the response ratio.

If you unzip the data and run the following code for verification, you will be able to get statistics that exactly match the statistics reported in the paper (Table 5 and Figure 3).

dirname = "factscore-unlabeled-predictions"
for fn in os.listdir(dirname):
    chatgpt_fs = []
    llama_fs = []
    n_facts = []
    with open(os.path.join(dirname, fn)) as f:
        for line in f:
            dp = json.loads(line)
            n_facts.append(len(dp["facts"]))
            if "ChatGPT_Labels" in dp:
                chatgpt_fs.append(np.mean([l=="S" for l in dp["ChatGPT_Labels"]]))
            llama_fs.append(np.mean([l=="S" for l in dp["LLAMA+NP_Labels"]]))
    print ("Model=%s\t(%.1f%% responding, %.1f facts/response)\tFactScore=%.1f (ChatGPT)\t%.1f (LLAMA)" % (
        fn.split(".")[0], len(n_facts)*100/500, np.mean(n_facts), np.mean(chatgpt_fs)*100, np.mean(llama_fs)*100
    ))