princeton-nlp / MQuAKE

[EMNLP 2023] MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions
https://arxiv.org/abs/2305.14795
MIT License
99 stars 7 forks source link

MQuAKE

This is the repository for our paper MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions.

In this paper, we introduce a benchmark for knowledge editing, MQuAKE, which comprises multi-hop questions that assess whether edited models correctly answer questions where the answer should change as an entailed consequence of edited facts.

We also propose a simple memory-based approach, MeLLo, which can scale with LLMs (up to 175B) and outperforms previous model editors by a large margin.

Please see our paper for more details.

[2024/9 Update] We have resolved a knowledge conflict issue in the original MQuAKE-CF-3k dataset. We updated this subset in datasets/MQuAKE-CF-3k-v2.json and updated results in our paper. We recommend future researchers follow this setting as wel.

Datasets

Overview

MQuAKE includes a dataset MQuAKE-CF based on counterfactual edits, and another dataset MQuAKE-T of temporal knowledge updates to evaluate model editors on real-world changes.

The datasets are included in datasets/. There are three files:

Data format

The dataset is saved as a list of dicts, each of which represents a data instance. An example in MQuAKE-CF is shown below.

{
  "case_id": 1561,
  "requested_rewrite": [
    {
      "prompt": "{} is associated with the sport of",
      "relation_id": "P641",
      "target_new": {"str": "cricket", "id": "Q5375"},
      "target_true": {"str": "association football", "id": "Q2736"},
      "subject": "Dudley Town F.C.",
      "question": "Which sport is Dudley Town F.C. associated with?"
    },
    ...
  ],
  "questions": [
    "What is the capital of the country where Dudley Town F.C.'s sport originated?",
    "Which city serves as the capital of the country where the sport played by Dudley Town F.C. originated?",
    "Which city is the capital of the country where the sport of Dudley Town F.C. was created?"
  ],
  "answer": "London",
  "answer_alias": ["London UK", ...],
  "new_answer": "Oderzo",
  "new_answer_alias": [],
  "single_hops": [
    {
      "question": "Which sport is Dudley Town F.C. associated with?",
      "cloze": "Dudley Town F.C. is associated with the sport of",
      "answer": "association football",
      "answer_alias": ["football", ...]
    },
    ...
  ],
  "new_single_hops": [...],
  "orig": {
    "triples": [
      ["Q5311995", "P641", "Q2736"],
      ["Q2736", "P495", "Q21"],
      ["Q21", "P36", "Q84"]
    ],
    "triples_labeled": [
      ["Dudley Town F.C.", "sport", "association football"],
      ...,
    ],
    "new_triples": [...,],
    "new_triples_labeled": [...,],
    "edit_triples": [
      ["Q5311995", "P641", "Q5375"],
      ["Q5375", "P495", "Q408"],
      ...
    ]
  }
}

For MQuAKE-T only:

Evaluation

There are many ways to check whether a fact is stored in a language model or not, e.g., cloze-style statement vs question, in-context-learning vs zero-shot prompting, CoT vs standard prompting.

We include evaluation setups that we use in our paper.

Edited facts (single-hop)

We follow the setups in prior work. We directly query the (edited) language models with a cloze-sytle statement (the same statement we used to inject the fact) without in-context-learning examples. In this case, the model output format is correct even without ICL, because the models are updated with the same cloze-style format and the likelihood of the gold answers is optimized when performing the edits.

Unedited facts (single-hop)

In this case, to ensure the model output format is desirable, we use questions with in-context-learning examples to prompt the language models. For each relation type, we write a prompt with 8 demonstrations. The prompts we used for each relation can be found in prompts/rel-prompts.json.

Multi-hop questions (including CoT)

We use either standard prompting or chain-of-thought (CoT) prompting to query the model with multi-hop questions. We use in-context-learning in both cases to ensure the output format is desirable. The prompts we used can be found in prompts/multihop-prompts.txt and prompts/multihop-cot-prompts.txt.

MeLLo

We propose a simple but effective method MeLLo, which (1) decomposes a multi-hop questions into subquestions; (2) prompts the base language model to provide tentative answers to subquestions; and (3) self-checks whether the tentative answers contradict any edited facts in the memory. See more details in our paper.

The in-context-learning examples we used in MeLLo can be founded in prompts/MeLLo-prompts.txt. A python notebook for running MeLLo on text-davinci-003 is here: run_mello.ipynb.

Bugs or Questions?

If you have any questions related to the repo or the paper, or you encounter any problems when using the datasets/code, feel free to email Zexuan Zhong (zzhong@cs.princeton.edu) or open an issue!

Citation

If you use our code in your research, please cite our work:

@article{zhong2023mquake,
  title={{MQuAKE}: Assessing Knowledge Editing in Language Models via Multi-Hop Questions},
  author={Zhong, Zexuan and Wu, Zhengxuan and Manning, Christopher D and Potts, Christopher and Chen, Danqi},
  journal={arXiv preprint arXiv:2305.14795},
  year={2023}
}