This repository contains the official code and released models for the paper Self-Play Preference Optimization for Language Model Alignment.
Authors: Yue Wu*, Zhiqing Sun*, Huizhuo Yuan*, Kaixuan Ji, Yiming Yang, Quanquan Gu
[Webpage] [Huggingface] [Paper]
We propose a new self-play framework dubbed SPPO for language model alignment and a new learning objective (called SPPO loss) derived from the self-play framework to fine-tune large language models efficiently.
AlpacaEval 2.0 leaderboard results of normal and length-controlled (LC) win rates in percentage (\%). Mistral-7B-SPPO can outperform larger models and Mistral-7B-SPPO (best-of-16) can outperform proprietary models such as GPT-4(6/13). Llama-3-8B-SPPO exhibits even better performance.
SPPO can significantly enhance the performance of an LLM without strong external signals such as responses or preferences from GPT-4. It can outperform the model trained with iterative direct preference optimization (DPO), among other methods. SPPO is theoretically grounded, ensuring that the LLM can converge to the von Neumann winner (i.e., Nash equilibrium) under general, potentially intransitive preference, and empirically validated through extensive evaluations on multiple datasets.
For more details, you can check our paper here.
Model | AlpacaEval2.0 LC Win Rate | AlpacaEval2.0 Win Rate |
---|---|---|
π€Mistral-7B-Instruct-v0.2 | 17.11 | 14.72 |
π€Mistral-7B-SPPO Iter1 | 24.79 | 23.51 |
π€Mistral-7B-SPPO Iter2 | 26.89 | 27.62 |
π€Mistral-7B-SPPO Iter3 | 28.53 | 31.02 |
π€Llama-3-8B-Instruct | 22.92 | 22.57 |
π€Llama-3-8B-SPPO Iter1 | 31.73 | 31.74 |
π€Llama-3-8B-SPPO Iter2 | 35.15 | 35.98 |
π€Llama-3-8B-SPPO Iter3 | 38.77 | 39.85 |
π€Gemma-2-9B-It | 45.08 | 35.62 |
π€Gemma-2-9B-SPPO Iter1 | 48.70 | 40.76 |
π€Gemma-2-9B-SPPO Iter2 | 50.93 | 44.64 |
π€Gemma-2-9B-SPPO Iter3 | 53.27 | 47.74 |
Our training code is based on the alignment-handbook codebase. We utilize vllm
for generation and pairRM
for ranking. Follow the steps below to set up your environment:
Create a Virtual Environment:
conda create -n sppo python=3.10
conda activate sppo
Install vllm for Generation:
pip install vllm
Install PairRM:
git clone https://github.com/yuchenlin/LLM-Blender.git
cd LLM-Blender
pip install -e .
Download and Install Training Dependencies:
git clone https://github.com/uclaml/SPPO.git
cd SPPO
pip install -e .
Execute the training scripts based on the base model you choose:
For Mistral-7B-Instruct-v0.2:
bash run_sppo_mistral.sh
For Llama-3-8B-Instruct:
bash run_sppo_llama-3.sh
These scripts manage the training iterations, generation, and PairRM ranking processes. Note that some scripts may attempt to push datasets to the Hugging Face Hub under the UCLA-AGI organization. Ensure you have write access, or modify the organization name accordingly, or comment out any push_to_hub
commands if necessary. Detailed scripts for each component are listed as follows:
Generation:
python scripts/generate.py --model $MODEL --maxlen 2048 --output_dir $OUTPUT_DIR --prompts $PROMPTS
Main parameters:
model
: Specifies the model used for generation. In the first iteration, the model should be either mistralai/Mistral-7B-Instruct-v0.2
or meta-llama/Meta-Llama-3-8B-Instruct
.maxlen
: Sets the token length for generation, defining the maximum number of tokens generated.pairs
: Determines the number of generated samples per prompt, with a default setting of 5. Please note that changing this number is not supported by the overall pipeline.output_dir
: Specifies the directory paths for saving intermediate results.prompts
: Defines the set of prompts used for generation.frac_len
: Enables the operation of vllm on multiple GPUs by dividing prompts into different fractions. frac_len
defines the number of prompts in each fraction. For usage examples, see generate.sh
.data_frac
: Used in conjunction with frac_len
for multi-GPU setups, data_frac
indicates which fraction of the data the current GPU is processing. Refer to generate.sh
for more details.Ranking:
python scripts/rank.py --output_dir $OUTPUT_DIR --prompts $PROMPTS
Main Parameters:
output_dir
: Specifies the directory paths where intermediate results are saved. Note that the default script attempts to push datasets to Hugging Face under the UCLA-AGI organization. You may need to adjust this to your organization, obtain write access for UCLA-AGI, or disable the push_to_hub
command if necessary.pairs
: Sets the number of generated samples per prompt, with a default of 5. Please note that other numbers are not supported by the overall pipeline.frac_len
: This parameter is used to enable the use of PairRM on multiple GPUs by dividing prompts into different fractions. frac_len
determines the number of prompts in each fraction. For usage examples, refer to generate.sh
.data_frac
: Similar to frac_len
, this option is used for running PairRM on multiple GPUs. It specifies which fraction of the data the current GPU is processing. See generate.sh
for examples.prompts
: Defines the set of prompts used for generation.gpu
: Indicates the GPU index used for ranking; it should match the data_frac
parameter.Training:
bash scripts/pipeline.sh --model $MODEL --iter $ITER --dataset $DATASET --output_dir $OUTPUT_DIR --num 1
Main Parameters:
We adhere to the established guidelines for evaluation and utilize the following repositories:
We provide the model configurations used during AlpacaEval 2 in the models_configs
directory. Please note that after the initial release of our model, we retrained it using a slightly modified prompt. The win rates observed post-retraining are comparable to the original results.
For questions related to the paper, please contact the authors via email. If you encounter any issues with the code or wish to report a bug, feel free to open an issue on our GitHub repository.
@article{wu2024self,
title={Self-play preference optimization for language model alignment},
author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
year={2024}
}
We thank the authors of The Alignment Handbook for their foundational contributions to the training code. We also acknowledge the use of PairRM for ranking and vllm for generation.