[ ] argilla/magpie-ultra-v0.1 · Datasets at Hugging Face

Dataset Card for magpie-ultra-v0.1

Dataset Summary

magpie-ultra is a synthetically generated dataset for supervised fine-tuning using the new Llama 3.1 405B-Instruct model, together with other Llama models like Llama-Guard-3-8B and Meta-Llama-3.1-8B-Instruct.

The dataset contains challenging instructions and responses for a wide variety of tasks, such as Coding & debugging, Math, Data analysis, Creative Writing, advice seeking, or Brainstorming.

Explore the dataset in Argilla.

Magpie Pipeline

As the name of the dataset indicates, we used Magpie recipe to generate the instruction-response pairs:

Paper: Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Magpie HF Org: Magpie-Align

The main difference with respect to the original Magpie release is that we used the new family of models Llama 3.1, and that we substantially generated less instruction-response pairs for this first iteration: 50K vs 1M rows. The Magpie pipeline can be summarised as follows:

Using meta-llama/Meta-Llama-3.1-405B-Instruct-FP8, we generate an instruction as described in the Magpie paper: we send the pre-query template to the model <|begin_of_text|><|start_header_id|>user<|end_header_id|>\\n\\n and thanks to the autoregressive capabilites of the LLM and having being fine-tuned on an SFT dataset, it will generate a user instruction until it generates the <eot_id> token. After that, we send the generated instruction to the LLM to get a response.
Using meta-llama/Meta-Llama-3.1-405B-Instruct, we generate another response for the generated instruction. Later, we assign a score to the responses given by the instruct and base models with RLHFlow/ArmoRM-Llama3-8B-v0.1. If the score of the instruct model substracted the score of the base model is positive, then we can consider the generated response by the instruct model is of higher quality.
Using meta-llama/Meta-Llama-3.1-8B-Instruct, we assess the quality and the difficulty of the generated instructions, and we classify them on one or more of the aforementioned categories: Information seeking, Reasoning, Planning, Editing, Coding & Debugging, Math, Data analysis, Creative writing, Advice seeking, Brainstorming or Others. To ensure that the outputs of the model were a valid JSON that we can easily parse, we used the structured output generation feature of distilabel.
Using meta-llama/Llama-Guard-3-8B, we classified the generated instruction-response pairs into "safe" or "unsafe" also providing the hazard category from the MLCommons AI Safety.
Finally, using Alibaba-NLP/gte-large-en-v1.5 and Faiss, we generated embeddings for all the instructions and computed its nearest neighbour to ensure instruction diversity on the final dataset.

The dataset was generated using a single 8xH100 machine:

Generating the instruction-response pairs took ~60 hours
Generating the responses with the base model took ~27 hours.
Computing the embeddings, assessing the quality and dificulty, classifying the instructions into categories, and classifying the instructions into safe or unsafe took ~24 hours.

Dataset columns

The examples have the following structure per configuration:

Column	Description
`model_name_response_base`	the name of the base model used to generate the response.
`instruction`	the generated instruction using Magpie pre-query template using the instruct model.
`response`	the generated response for the instruction using the instruct model (Llama 3.1 405B Instruct FP8).
`response_base`	the generated response for the instruction using the base model (Llama 3.1 405B FP8).
`intent`	the intent of the user query or instruction column (generated with Llama 3.1 8B Instruct).
`knowledge`	the required knowledge to generate a response for the instruction column (generated with Llama 3.1 8B Instruct).
`difficulty`	the difficulty of the generated instruction (generated with Llama 3.1 8B Instruct). It can be very easy, easy, medium, hard or very hard.
`model_name_difficulty`	the name of the model used to generate the intent, knowledge and difficulty columns.
`explanation`	an assessment, highlighting the strengths and/or weaknesses of the instruction (generated with Llama 3.1 8B Instruct).
`quality`	the quality of the generated instruction (generated with Llama 3.1 8B Instruct). It can be very poor, poor, average, good or excellent.
`model_name_quality`	the name of the model used to generate the explanation and quality columns.
`primary_tag`	the category of the instruction (generated with Llama 3.1 8B Instruct). It can be Information seeking, Reasoning, Planning, Editing, Coding & Debugging, Math, Data analysis, Creative writing, Advice seeking, Brainstorming or Others.
`other_tags`	other categories of the instruction (generated with Llama 3.1 8B Instruct). It can be Information seeking, Reasoning, Planning, Editing, Coding & Debugging, Math, Data analysis, Creative writing, Advice seeking, Brainstorming or Others.
`model_name_classification`	the name of the model used to assign a category to the instruction.
`embedding`	the sentence embedding generated for the instruction (generated with Alibaba NLP gte-large-en-v1.5).
`model_name_embeddings`	the name of the model used to generate the sentence embeddings.
`score`	the score given by the reward model (RLHFlow/ArmoRM-Llama3-8B-v0.1) for the column response.
`score_base`	the score given by the reward model (RLHFlow/ArmoRM-Llama3-8B-v0.1) for the column response_base.
`distilabel_metadata`	distilabel framework metadata containing information about the row.
`nn_indices`	the indices of the K (1) nearest neighbours.
`nn_scores`	the score or distance of the K (1) nearest neighbours. Used cosine similarity.
`guard`	the raw response given by the model used to check the safety of the instruction-response pair (generated with Llama Guard 3 8B).
`safe`	whether the instruction-response pair is safe or not.
`hazard_category`	the assigned hazard category from the MLCommons AI Safety by the guard model.
`score_difference`	the difference between the score and score_base.

The instruction and response columns can be used for SFT. Depending on the value of score_difference one can generate a chosen/rejected pair that can be used for DPO. If the score_difference is positive then we can select response as chosen an response_base as rejected, and the other way around.

Limitations

This is an unfiltered version of the dataset, we will release soon a filtered version (smaller).
The dataset is probably unbalanced (we will fix this in upcomming iterations).

Suggested labels

{'label-name': 'instruction-response', 'label-description': 'A dataset containing instruction-response pairs for various tasks generated using LLMs.', 'gh-repo': 'argilla/magpie-ultra-v0.1', 'confidence': 62.25}

irthomasthomas / undecidability

magpie-ultra - a synthetic dataset for supervised fine-tuning using Llama 3.1 #870

Dataset Card for magpie-ultra-v0.1

Dataset Summary

Magpie Pipeline

Dataset columns

Limitations

Suggested labels

{'label-name': 'instruction-response', 'label-description': 'A dataset containing instruction-response pairs for various tasks generated using LLMs.', 'gh-repo': 'argilla/magpie-ultra-v0.1', 'confidence': 62.25}

Related content

750 similarity score: 0.9

778 similarity score: 0.89

628 similarity score: 0.88

640 similarity score: 0.88

459 similarity score: 0.88

625 similarity score: 0.88