irthomasthomas / undecidability

6 stars 2 forks source link

magpie-ultra - a synthetic dataset for supervised fine-tuning using Llama 3.1 #870

Open ShellLM opened 1 month ago

ShellLM commented 1 month ago

Dataset Card for magpie-ultra-v0.1

Dataset Summary

magpie-ultra is a synthetically generated dataset for supervised fine-tuning using the new Llama 3.1 405B-Instruct model, together with other Llama models like Llama-Guard-3-8B and Meta-Llama-3.1-8B-Instruct.

The dataset contains challenging instructions and responses for a wide variety of tasks, such as Coding & debugging, Math, Data analysis, Creative Writing, advice seeking, or Brainstorming.

Explore the dataset in Argilla.

Magpie Pipeline

As the name of the dataset indicates, we used Magpie recipe to generate the instruction-response pairs:

The main difference with respect to the original Magpie release is that we used the new family of models Llama 3.1, and that we substantially generated less instruction-response pairs for this first iteration: 50K vs 1M rows. The Magpie pipeline can be summarised as follows:

  1. Using meta-llama/Meta-Llama-3.1-405B-Instruct-FP8, we generate an instruction as described in the Magpie paper: we send the pre-query template to the model <|begin_of_text|><|start_header_id|>user<|end_header_id|>\\n\\n and thanks to the autoregressive capabilites of the LLM and having being fine-tuned on an SFT dataset, it will generate a user instruction until it generates the <eot_id> token. After that, we send the generated instruction to the LLM to get a response.
  2. Using meta-llama/Meta-Llama-3.1-405B-Instruct, we generate another response for the generated instruction. Later, we assign a score to the responses given by the instruct and base models with RLHFlow/ArmoRM-Llama3-8B-v0.1. If the score of the instruct model substracted the score of the base model is positive, then we can consider the generated response by the instruct model is of higher quality.
  3. Using meta-llama/Meta-Llama-3.1-8B-Instruct, we assess the quality and the difficulty of the generated instructions, and we classify them on one or more of the aforementioned categories: Information seeking, Reasoning, Planning, Editing, Coding & Debugging, Math, Data analysis, Creative writing, Advice seeking, Brainstorming or Others. To ensure that the outputs of the model were a valid JSON that we can easily parse, we used the structured output generation feature of distilabel.
  4. Using meta-llama/Llama-Guard-3-8B, we classified the generated instruction-response pairs into "safe" or "unsafe" also providing the hazard category from the MLCommons AI Safety.
  5. Finally, using Alibaba-NLP/gte-large-en-v1.5 and Faiss, we generated embeddings for all the instructions and computed its nearest neighbour to ensure instruction diversity on the final dataset.

The dataset was generated using a single 8xH100 machine:

Dataset columns

The examples have the following structure per configuration:

Column Description
model_name_response_base the name of the base model used to generate the response.
instruction the generated instruction using Magpie pre-query template using the instruct model.
response the generated response for the instruction using the instruct model (Llama 3.1 405B Instruct FP8).
response_base the generated response for the instruction using the base model (Llama 3.1 405B FP8).
intent the intent of the user query or instruction column (generated with Llama 3.1 8B Instruct).
knowledge the required knowledge to generate a response for the instruction column (generated with Llama 3.1 8B Instruct).
difficulty the difficulty of the generated instruction (generated with Llama 3.1 8B Instruct). It can be very easy, easy, medium, hard or very hard.
model_name_difficulty the name of the model used to generate the intent, knowledge and difficulty columns.
explanation an assessment, highlighting the strengths and/or weaknesses of the instruction (generated with Llama 3.1 8B Instruct).
quality the quality of the generated instruction (generated with Llama 3.1 8B Instruct). It can be very poor, poor, average, good or excellent.
model_name_quality the name of the model used to generate the explanation and quality columns.
primary_tag the category of the instruction (generated with Llama 3.1 8B Instruct). It can be Information seeking, Reasoning, Planning, Editing, Coding & Debugging, Math, Data analysis, Creative writing, Advice seeking, Brainstorming or Others.
other_tags other categories of the instruction (generated with Llama 3.1 8B Instruct). It can be Information seeking, Reasoning, Planning, Editing, Coding & Debugging, Math, Data analysis, Creative writing, Advice seeking, Brainstorming or Others.
model_name_classification the name of the model used to assign a category to the instruction.
embedding the sentence embedding generated for the instruction (generated with Alibaba NLP gte-large-en-v1.5).
model_name_embeddings the name of the model used to generate the sentence embeddings.
score the score given by the reward model (RLHFlow/ArmoRM-Llama3-8B-v0.1) for the column response.
score_base the score given by the reward model (RLHFlow/ArmoRM-Llama3-8B-v0.1) for the column response_base.
distilabel_metadata distilabel framework metadata containing information about the row.
nn_indices the indices of the K (1) nearest neighbours.
nn_scores the score or distance of the K (1) nearest neighbours. Used cosine similarity.
guard the raw response given by the model used to check the safety of the instruction-response pair (generated with Llama Guard 3 8B).
safe whether the instruction-response pair is safe or not.
hazard_category the assigned hazard category from the MLCommons AI Safety by the guard model.
score_difference the difference between the score and score_base.

The instruction and response columns can be used for SFT. Depending on the value of score_difference one can generate a chosen/rejected pair that can be used for DPO. If the score_difference is positive then we can select response as chosen an response_base as rejected, and the other way around.

Limitations

Suggested labels

{'label-name': 'instruction-response', 'label-description': 'A dataset containing instruction-response pairs for various tasks generated using LLMs.', 'gh-repo': 'argilla/magpie-ultra-v0.1', 'confidence': 62.25}

ShellLM commented 1 month ago

Related content

750 similarity score: 0.9

778 similarity score: 0.89

628 similarity score: 0.88

640 similarity score: 0.88

459 similarity score: 0.88

625 similarity score: 0.88