likenneth / honest_llama

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
MIT License
461 stars 36 forks source link

Cleaned up replication of ITI on Llama3 #38

Closed jujipotle closed 3 months ago

jujipotle commented 3 months ago

Couple things to note:

  1. I modified README.md with newer instructions to finetune da-vinci using OpenAI API, the old instructions for curie were deprecated. I created a new notebook, finetune_gpt.ipynb, that streamlines the finetuning process.
  2. I updated environment.yaml with newer packages so that ITI can be run on H100s. I tested that validate_2fold.py works with this new environment, but I didn't extensively test every interaction with the new environment.
  3. The model architecture of baffo32/decapoda-research-llama-7B-hf differs from the meta-llama architectures, so I extract attention activations at different locations depending on which model is being run. I'm not 100% sure this is the most effective location to extract from.
  4. I make "instruction_prompt" a changeable hyperparameter since the default prompt (from Ouyang et al (2022)) reduces the model's informativeness. The llama3_tuning.md results are obtained using a modified instruction prompt.
  5. I create a directory validation/sweeping and include some bash scripts to make hyperparameter sweeping easier. I censor out sensitive information like API keys and SLURM account info.