MQuAKE

This is the repository for our paper MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions.

In this paper, we introduce a benchmark for knowledge editing, MQuAKE, which comprises multi-hop questions that assess whether edited models correctly answer questions where the answer should change as an entailed consequence of edited facts.

We also propose a simple memory-based approach, MeLLo, which can scale with LLMs (up to 175B) and outperforms previous model editors by a large margin.

Please see our paper for more details.

[2024/9 Update] We have resolved a knowledge conflict issue in the original MQuAKE-CF-3k dataset. We updated this subset in datasets/MQuAKE-CF-3k-v2.json and updated results in our paper. We recommend future researchers follow this setting as wel.

Datasets

Overview

MQuAKE includes a dataset MQuAKE-CF based on counterfactual edits, and another dataset MQuAKE-T of temporal knowledge updates to evaluate model editors on real-world changes.

The datasets are included in datasets/. There are three files:

MQuAKE-CF-3k-v2.json: a counterfactual dataset containing 3,000 instances. The results shown in our current paper are based on this dataset (as mentioned in the footnote 2 of the paper).
MQuAKE-CF.json: the full counterfactual dataset containing 9,218 instances.
MQuAKE-T.json: the temporal-based dataset containing 1,825 instances. This is designed to evaluate knowledge editing methods on real-world changes.
MQuAKE-CF-3k.json: the first version of MQuAKE-CF-3k, where there could be knowledge conflict when conducting multi-edit experiments.

Data format

The dataset is saved as a list of dicts, each of which represents a data instance. An example in MQuAKE-CF is shown below.

{
  "case_id": 1561,
  "requested_rewrite": [
    {
      "prompt": "{} is associated with the sport of",
      "relation_id": "P641",
      "target_new": {"str": "cricket", "id": "Q5375"},
      "target_true": {"str": "association football", "id": "Q2736"},
      "subject": "Dudley Town F.C.",
      "question": "Which sport is Dudley Town F.C. associated with?"
    },
    ...
  ],
  "questions": [
    "What is the capital of the country where Dudley Town F.C.'s sport originated?",
    "Which city serves as the capital of the country where the sport played by Dudley Town F.C. originated?",
    "Which city is the capital of the country where the sport of Dudley Town F.C. was created?"
  ],
  "answer": "London",
  "answer_alias": ["London UK", ...],
  "new_answer": "Oderzo",
  "new_answer_alias": [],
  "single_hops": [
    {
      "question": "Which sport is Dudley Town F.C. associated with?",
      "cloze": "Dudley Town F.C. is associated with the sport of",
      "answer": "association football",
      "answer_alias": ["football", ...]
    },
    ...
  ],
  "new_single_hops": [...],
  "orig": {
    "triples": [
      ["Q5311995", "P641", "Q2736"],
      ["Q2736", "P495", "Q21"],
      ["Q21", "P36", "Q84"]
    ],
    "triples_labeled": [
      ["Dudley Town F.C.", "sport", "association football"],
      ...,
    ],
    "new_triples": [...,],
    "new_triples_labeled": [...,],
    "edit_triples": [
      ["Q5311995", "P641", "Q5375"],
      ["Q5375", "P495", "Q408"],
      ...
    ]
  }
}

requested_rewrite: a list of the edited facts that we want to inject into the language model. In general, we follow the format of the Counterfact dataset. We use a cloze-sytle statement for the edits and separately specify the subject tokens, which are used in some baselines (e.g., ROME, MEMIT).
questions: three multi-hop questions generated by gpt-3.5-turbo. We evaluate the edited language model on all the three questions and regard the edit successful if the edited model can answer any of these questions.
answer and answer_alias: the gold answer before injecting new facts into language models. answer_alias is a list of aliases of the answer extracted from Wikidata.
new_answer and new_answer_alias: the gold answer after injecting new facts into language models.
single_hops: the single-hop questions that are associated with the chain of facts before editing. These questions are used to test if a language model has encoded all single-hop facts to answer the multi-hop questions.
new_single_hops: the single-hop questions that are associated with the chain of facts after editing.
orig: the raw data from Wikidata.
- triples and new_triples: the corresponding list of (s, r, o) fact triples before and after editing.
- triples_labeled and new_triples_labeled: the list of labeled fact triples.
- edited_triples: the list of edited facts (s, r, o*) that we want to inject into language models.

For MQuAKE-T only:

answer_extended: the extended gold answers before injecting new facts into language models. We extend the pre-edit gold answer for MQuAKE-T to minimize the effects of mismatch of the LM training corpus and our Wikidata dump. This includes other possible gold answers besides the one we extract from our Wikidata dump (see Appendix E of our paper).

Evaluation

There are many ways to check whether a fact is stored in a language model or not, e.g., cloze-style statement vs question, in-context-learning vs zero-shot prompting, CoT vs standard prompting.

We include evaluation setups that we use in our paper.

Edited facts (single-hop)

We follow the setups in prior work. We directly query the (edited) language models with a cloze-sytle statement (the same statement we used to inject the fact) without in-context-learning examples. In this case, the model output format is correct even without ICL, because the models are updated with the same cloze-style format and the likelihood of the gold answers is optimized when performing the edits.

Unedited facts (single-hop)

In this case, to ensure the model output format is desirable, we use questions with in-context-learning examples to prompt the language models. For each relation type, we write a prompt with 8 demonstrations. The prompts we used for each relation can be found in prompts/rel-prompts.json.

Multi-hop questions (including CoT)

We use either standard prompting or chain-of-thought (CoT) prompting to query the model with multi-hop questions. We use in-context-learning in both cases to ensure the output format is desirable. The prompts we used can be found in prompts/multihop-prompts.txt and prompts/multihop-cot-prompts.txt.

MeLLo

We propose a simple but effective method MeLLo, which (1) decomposes a multi-hop questions into subquestions; (2) prompts the base language model to provide tentative answers to subquestions; and (3) self-checks whether the tentative answers contradict any edited facts in the memory. See more details in our paper.

The in-context-learning examples we used in MeLLo can be founded in prompts/MeLLo-prompts.txt. A python notebook for running MeLLo on text-davinci-003 is here: run_mello.ipynb.

Bugs or Questions?

If you have any questions related to the repo or the paper, or you encounter any problems when using the datasets/code, feel free to email Zexuan Zhong (zzhong@cs.princeton.edu) or open an issue!

Citation

If you use our code in your research, please cite our work:

@article{zhong2023mquake,
  title={{MQuAKE}: Assessing Knowledge Editing in Language Models via Multi-Hop Questions},
  author={Zhong, Zexuan and Wu, Zhengxuan and Manning, Christopher D and Potts, Christopher and Chen, Danqi},
  journal={arXiv preprint arXiv:2305.14795},
  year={2023}
}

princeton-nlp / MQuAKE

readme