LLaMa 2 Fine-tuning data

RLHF data

We chose a binary comparison protocol over other schemes, mainly because it enables us to maximize the diversity of collected prompts
- other strategies are worth considering, which we leave for future work
- human annotators select which of two model outputs they prefer

Annotators...

Write a prompt
Choose between two sampled model responses based on provided criteria
- In order to maximize the diversity, the two responses to a given prompt are sampled from two different model variants and varying the temperature hyper-parameter.
Label the degree to which they prefer their chosen response over the alternative (4 scales)
- significantly better, better, slightly better, or negligibly better/ unsure

helpfulness
- Helpfulness refers to how well Llama 2-Chat responses fulfill users’ requests and provide requested information
safety
- safety refers to whether Llama 2-Chat’s responses are unsafe

In addition to preference, we also annotate "absolute safety" with three categories:

We do not include any examples from category 4, as we believe safer responses will also be better/preferred by humans.

Human annotations were collected every week (by batch)
As we collected more preference data, our reward models improved, and we were able to train progressively better versions for Llama 2-Chat.
- more data -> better model -> better output -> better data
- Llama 2-Chat improvement also shifted the model’s data distribution
It is important to gather new preference data using the latest Llama 2-Chat iterations before starting a new tuning iteration.
- This step helps keep the reward model on-distribution and maintain an accurate reward for the latest model
  Comparisons with other datasets
We collected a large dataset of over 1 million binary comparisons based on humans applying our specified guidelines
Note that the number of tokens in prompts and answers differs depending on the text domain.
- Summarization and online forum data generally have longer prompts
- Dialogue-style prompts are usually shorter
Meta data: more conversation turns and longer on average.

Open-source datasets were used to bootstrap our reward models while we were in the process of collecting preference annotation data.
In our experiments, we do not observe negative transfer from the open-source preference datasets. Thus, we have decided to keep them in our data mixture, as they could enable better generalization for the reward model and prevent reward hacking
Experimented with different mixing recipes for both Helpfulness and Safety reward models and best settings til now:
- Helpfulness reward model:
- All Meta Helpfulness data (50%)
- Remaining data: uniformly sampled from Meta Safety and from the open-source datasets (50%)
- Here is the original text for reference, as there are some ambiguous aspects.
  
  Helpfulness reward model is eventually trained on all Meta Helpfulness data, combined with an equal parts of the remaining data uniformly sampled from Meta Safety and from the open-source datasets.
- Safety reward model:
- All Meta Safety and Anthropic Harmless data (90%)
- Meta Helpfulness and open-source helpfulness data (10%)
- We found that the setting with 10% helpfulness data is especially beneficial for the accuracy on samples where both the chosen and rejected responses were deemed safe. (Category 2 in terms of "absolute safety" above)