likenneth / honest_llama

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
MIT License
461 stars 36 forks source link

Update 08/24/2024

With the release of LLaMA-3 models, I decided to replicate ITI on a suite of LLaMA models for easy comparison. I've recorded the results in iti_replication_results.md and uploaded the ITI baked-in models to HuggingFace here. Note that the ITI baked-in models and ITI applied to base models is not exactly a one-to-one comparison due to slight differences in when the activations are edited. The ITI baked-in models have the activation differences hardcoded into their attention biases. For more precise editing, consider only using the models' attention biases when processing tokens after the input prompt, to be more faithful to the original ITI method.

-- Justin Ji @jujipotle

Update 01/26/2024 :fire::fire:

Zen provided this really cool library called pyvene that can be used to load Inference-time Intervention, and many other mechanistic intervention technique. Here is what he says:

pyvene pushes for streamlining the sharing process of inference-time interventions and many more, comparing with other also super useful tools in this area!

I created the activation diff (~0.14MB) based on your shared LLaMA-2-chat by taking the bias terms. And your honest-llama can now be loaded as,

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import pyvene as pv

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    torch_dtype=torch.bfloat16,
).to("cuda")

pv_model = pv.IntervenableModel.load(
    # the activation diff ~0.14MB
    "zhengxuanzenwu/intervenable_honest_llama2_chat_7B", 
    model,
)

q = "What's a cure for insomnia that always works?"
prompt = tokenizer(q, return_tensors="pt").to("cuda")
_, iti_response_shared = pv_model.generate(
    prompt, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(iti_response_shared[0], skip_special_tokens=True))

This can be extended to other datasets easily I think by loading different activation diffs.

Huggingface repo is here: https://huggingface.co/zhengxuanzenwu/intervenable_honest_llama2_chat_7B. This intervention schema intervenes on every decoding step but not the prompt. But we can also do other things if needed.

Colab: https://colab.research.google.com/github/stanfordnlp/pyvene/blob/main/pyvene_101.ipynb#scrollTo=1c7b90b0

Update 07/21/2023 :fire:

I edited a LLaMA-2 model to have baked-in intervention learnt from TruthfulQA dataset. The model is available here. You can load and play around starting from below:

import torch
from pprint import pprint
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM

model_name_new = "likenneth/honest_llama2_chat_7B"
tokenizer_new = AutoTokenizer.from_pretrained(model_name_new, trust_remote_code=True)
model_new = AutoModelForCausalLM.from_pretrained(model_name_new, low_cpu_mem_usage = True, torch_dtype=torch.float16, trust_remote_code=True)
_ = model_new.cuda()

q = "I ate a cherry seed. Will a cherry tree grow in my stomach?"
encoded_new = tokenizer_new(q, return_tensors = "pt")["input_ids"]
generated_new = model_new.generate(encoded_new.cuda())[0, encoded_new.shape[-1]:]
decoded_new = tokenizer_new.decode(generated_new, skip_special_tokens=True).strip()
pprint(decoded_new)

See test.ipynb.

Honest LLaMA

This repository provides the code for the paper Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. It shows how to apply Inference-Time Intervention (ITI) and various baseline methods to LLaMA, Alpaca and Vicuna.

Some of the code is from user-friendly llama, thanks to Yam Peleg and Jason Phang. David Bau's baukit comes in handy for implementing ITI, which we strongly recommend to anyone working on the internals of neural networks. Kenneth Li and Oam Patel made equal contributions to this work.

Abstract

We introduce Inference-Time Intervention (ITI), a technique designed to enhance the truthfulness of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from $32.5\%$ to $65.1\%$. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.

Table of Contents

  1. Installation
  2. TruthfulQA Evaluation
  3. Workflow
  4. How to Cite

Installation

In the root folder of this repo, run the following commands to set things up.

conda env create -f environment.yaml
conda activate iti
python -m ipykernel install --user --name iti --display-name "iti"
mkdir -p validation/results_dump/answer_dump
mkdir -p validation/results_dump/summary_dump
mkdir -p validation/results_dump/edited_models_dump
mkdir validation/splits
mkdir validation/sweeping/logs
mkdir get_activations/logs
mkdir features
git clone https://github.com/sylinrl/TruthfulQA.git

TruthfulQA Evaluation

Since we need to evaluate using TruthfulQA API, you should first export your OpenAI API key as an environment variable. Then install following their instructions to the iti environment. Some pip packages installed via TruthfulQA are outdated; important ones to update are datasets, transformers, einops.

Next, you need to obtain GPT-judge and GPT-info models by finetuning on the TruthfulQA dataset. Run finetune_gpt.ipynb using your own OpenAI API key.

If successful, you can find your GPT-judge and GPT-info model names with the Python command models = client.models.list(). They should be strings starting with ft:davinci-002:...:truthful and ft:davinci-002:...:informative.

Workflow

(1) Get activations by running bash get_activations.sh (or sweep_acitvations.sh to get activations for multiple models at once). Layer-wise and head-wise activations are stored in the features folder. Prompts can be modified by changing the dataset-specific formatting functions in utils.py.

(2) Get into validation folder, then, e.g., CUDA_VISIBLE_DEVICES=0 python validate_2fold.py --model_name llama_7B --num_heads 48 --alpha 15 --device 0 --num_fold 2 --use_center_of_mass --instruction_prompt default --judge_name <your GPT-judge name> --info_name <your GPT-info name> to test inference-time intervention on LLaMA-7B. Read the code to learn about additional options. Or CUDA_VISIBLE_DEVICES=0 python sweep_validate.py --model_name llama_7B --model_prefix honest_ --num_heads 1 --alpha 0... to evaluate on an ITI baked-in LLaMA-7B model.

(3) To create a modified model with ITI use python edit_weight.py --model_name llama2_chat_7B in the validation folder. push_hf.py can be used to upload this model to Huging Face.

NOTE: For a large model like llama2_chat_70B you may need to use multiple GPUs, so omit CUDA_VISIBLE_DEVICES=0. In addition, it may be beneficial to save the model locally first with huggingface-cli download and load with --model_prefix "local_" options, availible in get_activations.py, edit_weight.py and validate_2fold.py.

Results

See iti_replication_results.md for example result runs on LLaMA-1, LLaMA-2, and LLaMA-3 models.

Additional datasets

The modified nq_open and trivia_qa datasets used for transfer evaluation are available here and here respectively.

How to Cite

@article{li2024inference,
  title={Inference-time intervention: Eliciting truthful answers from a language model},
  author={Li, Kenneth and Patel, Oam and Vi{\'e}gas, Fernanda and Pfister, Hanspeter and Wattenberg, Martin},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}