magpie-ultra is a synthetically generated dataset for supervised fine-tuning using the new Llama 3.1 405B-Instruct model, together with other Llama models like Llama-Guard-3-8B and Meta-Llama-3.1-8B-Instruct.
The dataset contains challenging instructions and responses for a wide variety of tasks, such as Coding & debugging, Math, Data analysis, Creative Writing, advice seeking, or Brainstorming.
Explore the dataset in Argilla.
Magpie Pipeline
As the name of the dataset indicates, we used Magpie recipe to generate the instruction-response pairs:
The main difference with respect to the original Magpie release is that we used the new family of models Llama 3.1, and that we substantially generated less instruction-response pairs for this first iteration: 50K vs 1M rows. The Magpie pipeline can be summarised as follows:
Using meta-llama/Meta-Llama-3.1-405B-Instruct-FP8, we generate an instruction as described in the Magpie paper: we send the pre-query template to the model <|begin_of_text|><|start_header_id|>user<|end_header_id|>\\n\\n and thanks to the autoregressive capabilites of the LLM and having being fine-tuned on an SFT dataset, it will generate a user instruction until it generates the <eot_id> token. After that, we send the generated instruction to the LLM to get a response.
Using meta-llama/Meta-Llama-3.1-405B-Instruct, we generate another response for the generated instruction. Later, we assign a score to the responses given by the instruct and base models with RLHFlow/ArmoRM-Llama3-8B-v0.1. If the score of the instruct model substracted the score of the base model is positive, then we can consider the generated response by the instruct model is of higher quality.
Using meta-llama/Meta-Llama-3.1-8B-Instruct, we assess the quality and the difficulty of the generated instructions, and we classify them on one or more of the aforementioned categories: Information seeking, Reasoning, Planning, Editing, Coding & Debugging, Math, Data analysis, Creative writing, Advice seeking, Brainstorming or Others. To ensure that the outputs of the model were a valid JSON that we can easily parse, we used the structured output generation feature of distilabel.
Using meta-llama/Llama-Guard-3-8B, we classified the generated instruction-response pairs into "safe" or "unsafe" also providing the hazard category from the MLCommons AI Safety.
Finally, using Alibaba-NLP/gte-large-en-v1.5 and Faiss, we generated embeddings for all the instructions and computed its nearest neighbour to ensure instruction diversity on the final dataset.
The dataset was generated using a single 8xH100 machine:
Generating the instruction-response pairs took ~60 hours
Generating the responses with the base model took ~27 hours.
Computing the embeddings, assessing the quality and dificulty, classifying the instructions into categories, and classifying the instructions into safe or unsafe took ~24 hours.
Dataset columns
The examples have the following structure per configuration:
Column
Description
model_name_response_base
the name of the base model used to generate the response.
instruction
the generated instruction using Magpie pre-query template using the instruct model.
response
the generated response for the instruction using the instruct model (Llama 3.1 405B Instruct FP8).
response_base
the generated response for the instruction using the base model (Llama 3.1 405B FP8).
intent
the intent of the user query or instruction column (generated with Llama 3.1 8B Instruct).
knowledge
the required knowledge to generate a response for the instruction column (generated with Llama 3.1 8B Instruct).
difficulty
the difficulty of the generated instruction (generated with Llama 3.1 8B Instruct). It can be very easy, easy, medium, hard or very hard.
model_name_difficulty
the name of the model used to generate the intent, knowledge and difficulty columns.
explanation
an assessment, highlighting the strengths and/or weaknesses of the instruction (generated with Llama 3.1 8B Instruct).
quality
the quality of the generated instruction (generated with Llama 3.1 8B Instruct). It can be very poor, poor, average, good or excellent.
model_name_quality
the name of the model used to generate the explanation and quality columns.
primary_tag
the category of the instruction (generated with Llama 3.1 8B Instruct). It can be Information seeking, Reasoning, Planning, Editing, Coding & Debugging, Math, Data analysis, Creative writing, Advice seeking, Brainstorming or Others.
other_tags
other categories of the instruction (generated with Llama 3.1 8B Instruct). It can be Information seeking, Reasoning, Planning, Editing, Coding & Debugging, Math, Data analysis, Creative writing, Advice seeking, Brainstorming or Others.
model_name_classification
the name of the model used to assign a category to the instruction.
embedding
the sentence embedding generated for the instruction (generated with Alibaba NLP gte-large-en-v1.5).
model_name_embeddings
the name of the model used to generate the sentence embeddings.
score
the score given by the reward model (RLHFlow/ArmoRM-Llama3-8B-v0.1) for the column response.
score_base
the score given by the reward model (RLHFlow/ArmoRM-Llama3-8B-v0.1) for the column response_base.
distilabel_metadata
distilabel framework metadata containing information about the row.
nn_indices
the indices of the K (1) nearest neighbours.
nn_scores
the score or distance of the K (1) nearest neighbours. Used cosine similarity.
guard
the raw response given by the model used to check the safety of the instruction-response pair (generated with Llama Guard 3 8B).
safe
whether the instruction-response pair is safe or not.
hazard_category
the assigned hazard category from the MLCommons AI Safety by the guard model.
score_difference
the difference between the score and score_base.
The instruction and response columns can be used for SFT. Depending on the value of score_difference one can generate a chosen/rejected pair that can be used for DPO. If the score_difference is positive then we can select response as chosen an response_base as rejected, and the other way around.
Limitations
This is an unfiltered version of the dataset, we will release soon a filtered version (smaller).
The dataset is probably unbalanced (we will fix this in upcomming iterations).
Suggested labels
{'label-name': 'instruction-response', 'label-description': 'A dataset containing instruction-response pairs for various tasks generated using LLMs.', 'gh-repo': 'argilla/magpie-ultra-v0.1', 'confidence': 62.25}
Dataset Card for magpie-ultra-v0.1
Dataset Summary
magpie-ultra
is a synthetically generated dataset for supervised fine-tuning using the new Llama 3.1 405B-Instruct model, together with other Llama models like Llama-Guard-3-8B and Meta-Llama-3.1-8B-Instruct.The dataset contains challenging instructions and responses for a wide variety of tasks, such as Coding & debugging, Math, Data analysis, Creative Writing, advice seeking, or Brainstorming.
Explore the dataset in Argilla.
Magpie Pipeline
As the name of the dataset indicates, we used Magpie recipe to generate the instruction-response pairs:
The main difference with respect to the original Magpie release is that we used the new family of models Llama 3.1, and that we substantially generated less instruction-response pairs for this first iteration: 50K vs 1M rows. The Magpie pipeline can be summarised as follows:
meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
, we generate an instruction as described in the Magpie paper: we send the pre-query template to the model<|begin_of_text|><|start_header_id|>user<|end_header_id|>\\n\\n
and thanks to the autoregressive capabilites of the LLM and having being fine-tuned on an SFT dataset, it will generate a user instruction until it generates the<eot_id>
token. After that, we send the generated instruction to the LLM to get a response.meta-llama/Meta-Llama-3.1-405B-Instruct
, we generate another response for the generated instruction. Later, we assign a score to the responses given by the instruct and base models withRLHFlow/ArmoRM-Llama3-8B-v0.1
. If the score of the instruct model substracted the score of the base model is positive, then we can consider the generated response by the instruct model is of higher quality.meta-llama/Meta-Llama-3.1-8B-Instruct
, we assess the quality and the difficulty of the generated instructions, and we classify them on one or more of the aforementioned categories: Information seeking, Reasoning, Planning, Editing, Coding & Debugging, Math, Data analysis, Creative writing, Advice seeking, Brainstorming or Others. To ensure that the outputs of the model were a valid JSON that we can easily parse, we used the structured output generation feature of distilabel.meta-llama/Llama-Guard-3-8B
, we classified the generated instruction-response pairs into "safe" or "unsafe" also providing the hazard category from the MLCommons AI Safety.Alibaba-NLP/gte-large-en-v1.5
and Faiss, we generated embeddings for all the instructions and computed its nearest neighbour to ensure instruction diversity on the final dataset.The dataset was generated using a single 8xH100 machine:
Dataset columns
The examples have the following structure per configuration:
model_name_response_base
instruction
response
response_base
intent
knowledge
difficulty
model_name_difficulty
explanation
quality
model_name_quality
primary_tag
other_tags
model_name_classification
embedding
model_name_embeddings
score
score_base
distilabel_metadata
nn_indices
nn_scores
guard
safe
hazard_category
score_difference
The instruction and response columns can be used for SFT. Depending on the value of score_difference one can generate a chosen/rejected pair that can be used for DPO. If the score_difference is positive then we can select response as chosen an response_base as rejected, and the other way around.
Limitations
Suggested labels
{'label-name': 'instruction-response', 'label-description': 'A dataset containing instruction-response pairs for various tasks generated using LLMs.', 'gh-repo': 'argilla/magpie-ultra-v0.1', 'confidence': 62.25}