jinzhuoran / RWKU

RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
https://rwku-bench.github.io
56 stars 4 forks source link
adversarial-attacks benchmark evaluation-framework forgetting large-language-models membership-inference-attack natural-language-processing privacy-protection right-to-be-forgotten unlearning

RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models

🏠 Homepage | πŸ“œ Paper | πŸ€— Dataset | πŸš€ Installation

News

Description

RWKU is a real-world knowledge unlearning benchmark specifically designed for large language models (LLMs). This benchmark contains 200 real-world unlearning targets and 13,131 multi-level forget probes, including 3,268 fill-in-the-blank probes, 2,879 question-answer probes, and 6,984 adversarial-attack probes. RWKU is designed based on the following three key factors:

Installation

git clone https://github.com/jinzhuoran/RWKU.git
conda create -n rwku python=3.10
conda activate rwku
cd RWKU
pip install -r requirements.txt

Dataset Download and Processing

One way is to load the dataset from Huggingface and preprocess it.

cd process
python data_process.py

Another way is to download the processed dataset directly from Google Drive.

cd LLaMA-Factory/data
bash download.sh

Unlearning Target

RWKU includes 200 famous people from The Most Famous All-time People Rank, such as Stephen King, Warren Buffett, Taylor Swift, etc. We demonstrate that such popular knowledge is widely present in various LLMs through memorization quantification, making it more suitable for unlearning.

from datasets import load_dataset
forget_target = load_dataset("jinzhuoran/RWKU", 'forget_target')['train'] # 200 unlearning targets

Evaluation Framework

RWKU mainly consists of four subsets, including forget set, neighbor set, MIA set and utility set. Evaluation Framework.

Forget Set

from datasets import load_dataset
forget_level1 = load_dataset("jinzhuoran/RWKU", 'forget_level1')['test'] # forget knowledge memorization probes
forget_level2 = load_dataset("jinzhuoran/RWKU", 'forget_level2')['test'] # forget knowledge manipulation probes
forget_level3 = load_dataset("jinzhuoran/RWKU", 'forget_level3')['test'] # forget adversarial attack probes

Neighbor Set

from datasets import load_dataset
neighbor_level1 = load_dataset("jinzhuoran/RWKU", 'neighbor_level1')['test'] # neighbor knowledge memorization probes
neighbor_level2 = load_dataset("jinzhuoran/RWKU", 'neighbor_level2')['test'] # neighbor knowledge manipulation probes

MIA Set

from datasets import load_dataset
mia_forget = load_dataset("jinzhuoran/RWKU", 'mia_forget') # forget member set
mia_retain = load_dataset("jinzhuoran/RWKU", 'mia_retain') # retain member set

Utility Set

from datasets import load_dataset
utility_general = load_dataset("jinzhuoran/RWKU", 'utility_general') # general ability
utility_reason = load_dataset("jinzhuoran/RWKU", 'utility_reason') # reasoning ability
utility_truthfulness = load_dataset("jinzhuoran/RWKU", 'utility_truthfulness') # truthfulness
utility_factuality = load_dataset("jinzhuoran/RWKU", 'utility_factuality') # factuality
utility_fluency = load_dataset("jinzhuoran/RWKU", 'utility_fluency') # fluency

Supported Unlearning Methods

Forget Corpus Generation

We have provided the forget corpus for both Llama-3-8B-Instruct and Phi-3-mini-4k-instruct to facilitate reproducibility.

from datasets import load_dataset
train_positive_llama3 = load_dataset("jinzhuoran/RWKU", 'train_positive_llama3')['train'] # For GA and NPO
train_pair_llama3 = load_dataset("jinzhuoran/RWKU", 'train_pair_llama3')['train'] # For DPO
train_refusal_llama3 = load_dataset("jinzhuoran/RWKU", 'train_refusal_llama3')['train'] # For RT

train_positive_phi3 = load_dataset("jinzhuoran/RWKU", 'train_positive_phi3')['train'] # For GA and NPO
train_pair_phi3 = load_dataset("jinzhuoran/RWKU", 'train_pair_phi3')['train'] # For DPO
train_refusal_phi3 = load_dataset("jinzhuoran/RWKU", 'train_refusal_phi3')['train'] # For RT

Additionally, you can construct your own forget corpus to explore new methods and models. We have included our generation script for reference. Please feel free to explore better methods for generating forget corpus.

cd generation
python pair_generation.py # For GA, DPO and NPO
python question_generation.py # For RT

Evaluating Models

To evaluate the model original performance before unlearning.

cd LLaMA-Factory/scripts
bash run_original.sh

Unlearning Models

We adapt LLaMA-Factory to train the model. We provide several scripts to run various unlearning methods.

Single-sample Unlearning Setting

To run the In-Context Unlearning (ICU) method on Llama-3-8B-Instruct.

cd LLaMA-Factory
bash scripts/full/run_icu.sh

To run the Gradient Ascent (GA) method on Llama-3-8B-Instruct.

cd LLaMA-Factory
bash scripts/full/run_ga.sh

To run the Direct Preference Optimization (DPO) method on Llama-3-8B-Instruct.

cd LLaMA-Factory
bash scripts/full/run_dpo.sh

To run the Negative Preference Optimization (NPO) method on Llama-3-8B-Instruct.

cd LLaMA-Factory
bash scripts/full/run_npo.sh

To run the Rejection Tuning (RT) method on Llama-3-8B-Instruct.

cd LLaMA-Factory
bash scripts/full/run_rt.sh

Batch-sample Unlearning Setting

To run the In-Context Unlearning (ICU) method on Llama-3-8B-Instruct.

cd LLaMA-Factory
bash scripts/batch/run_icu.sh

To run the Gradient Ascent (GA) method on Llama-3-8B-Instruct.

cd LLaMA-Factory
bash scripts/batch/run_ga.sh

To run the Direct Preference Optimization (DPO) method on Llama-3-8B-Instruct.

cd LLaMA-Factory
bash scripts/batch/run_dpo.sh

To run the Negative Preference Optimization (NPO) method on Llama-3-8B-Instruct.

cd LLaMA-Factory
bash scripts/batch/run_npo.sh

To run the Rejection Tuning (RT) method on Llama-3-8B-Instruct.

cd LLaMA-Factory
bash scripts/batch/run_rt.sh

LoRA Unlearning Setting

Please set --finetuning_type lora and --lora_target q_proj,v_proj.

Partial-layer Unlearning Setting

Please set --train_layers 0-4.

Experimental Results

Results of main experiment on LLaMA3-Instruct (8B). Results of main experiment on LLaMA3-Instruct (8B).

Results of main experiment on Phi-3 Mini-4K-Instruct (3.8B).

Results of main experiment on Phi-3 Mini-4K-Instruct (3.8B).

Citation

If you find our codebase and dataset beneficial, please cite our work:

@misc{jin2024rwku,
    title={RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models},
    author={Zhuoran Jin and Pengfei Cao and Chenhao Wang and Zhitao He and Hongbang Yuan and Jiachun Li and Yubo Chen and Kang Liu and Jun Zhao},
    year={2024},
    eprint={2406.10890},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Other Related Projects