kibitzing / awesome-llm-data

A repository of information about data used in training large language models (LLMs)
0 stars 0 forks source link

LLaMa 2 Fine-tuning data #2

Open kibitzing opened 5 months ago

kibitzing commented 5 months ago

SFT data

  1. Started the SFT stage with publicly available instruction tuning data (Chung et al., 2022)
  2. Fewer but high quality > Millions of data but low quality

By setting aside millions of examples from third-party datasets and using fewer but higher-quality examples from our own vendor-based annotation efforts, our results notably improved.

We found that SFT annotations in the order of tens of thousands was enough to achieve a high-quality result. (stopped annotating SFT after 27,540 annotations)

  1. Note that we do not include any Meta user data.

SFT data quality check

kibitzing commented 5 months ago

RLHF data

Reward modeling

Annotation procedure

Annotators...

  1. Write a prompt
  2. Choose between two sampled model responses based on provided criteria
    • In order to maximize the diversity, the two responses to a given prompt are sampled from two different model variants and varying the temperature hyper-parameter.
  3. Label the degree to which they prefer their chosen response over the alternative (4 scales)
    • significantly better, better, slightly better, or negligibly better/ unsure

Two criteria

In addition to preference, we also annotate "absolute safety" with three categories:

  1. the preferred response is safe and the other response is not (18%)
  2. both responses are safe (47%)
  3. both responses are unsafe (35%)
  4. the preferred response is unsafe and the other response is safe (0%)

We do not include any examples from category 4, as we believe safer responses will also be better/preferred by humans.

Human annotation collection process

Data Composition

  1. Open-source datasets were used to bootstrap our reward models while we were in the process of collecting preference annotation data.
  2. In our experiments, we do not observe negative transfer from the open-source preference datasets. Thus, we have decided to keep them in our data mixture, as they could enable better generalization for the reward model and prevent reward hacking
  3. Experimented with different mixing recipes for both Helpfulness and Safety reward models and best settings til now:
    • Helpfulness reward model:
    • All Meta Helpfulness data (50%)
    • Remaining data: uniformly sampled from Meta Safety and from the open-source datasets (50%)
    • Here is the original text for reference, as there are some ambiguous aspects.

      Helpfulness reward model is eventually trained on all Meta Helpfulness data, combined with an equal parts of the remaining data uniformly sampled from Meta Safety and from the open-source datasets.

    • Safety reward model:
    • All Meta Safety and Anthropic Harmless data (90%)
    • Meta Helpfulness and open-source helpfulness data (10%)
    • We found that the setting with 10% helpfulness data is especially beneficial for the accuracy on samples where both the chosen and rejected responses were deemed safe. (Category 2 in terms of "absolute safety" above)