From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
(NAACL'24)
Chinese Version: [知乎]
This is the repo for the Cherry Data Selection project, which introduces a self-guided methodology for LLMs to autonomously discern and select cherry samples from vast open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM.
The repo contains:
(Feel free to email Ming (Homepage, Email) for any questions or feedback.)
Our study puts forth a method for autonomously sifting through expansive open-source datasets to discover the most impactful training samples. We coin these samples as "cherry data", designating those data fragments that hold the potential to exponentially enhance LLM instruction tuning. At the heart of our research is the hypothesis that during their preliminary training stages with carefully chosen instruction data, LLMs can develop an intrinsic capability to discern instructions. This foundational understanding equips them with the discernment to assess the quality of broader datasets thus making it possible to estimate the instruction-following difficulty in a self-guided manner.
Initially, the model is familiarized with a fraction of the target dataset during the "Learning from Brief Experience" phase. This preliminary knowledge paves the way for the subsequent "Evaluating Based on Experience" phase, where we meticulously evaluate the model's response generation. To estimate the difficulty of a given example, we propose a novel metric called Instruction-Following Difficulty (IFD) score in which both models' capability to generate a response to a given instruction and the models' capability to generate a response directly are measured and compared. By calculating Instruction-Following Difficulty (IFD) scores, we quantify the challenge each sample presents to the model. Harnessing these insights, the "Retraining from Self-Guided Experience" phase utilizes cherry data with standout IFD scores to hone the model, culminating in our superior cherry models. The net result is a model that aligns more adeptly with instructions, ensuring enhanced performance.
Install the dependencies with pip install -r requirements.txt
Note: This requirements.txt
is originated from the Stanford Alpaca. If you are using a different code base with PyTorch installed, we recommend you manually install the below packages and do not need to install from requirements.txt
pip install tqdm
pip install scikit-learn
python cherry_seletion/data_analysis.py \
--data_path data/alpaca_data.json \
--save_path alpaca_data_pre.pt \
--model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
--max_length 512 \
--prompt alpaca \
--mod pre
--data_path
: The targeted dataset in the Alpaca format
--save_path
: The path to save the .pt
file containing embeddings or scores
--prompt
: The prompt type used for training and selecting data, can choose between alpaca
or wiz
--mod
: pre
used for getting needed embeddings or scores on selecting pre-experienced samples and cherry
used for cherry
python cherry_seletion/data_by_cluster.py \
--pt_data_path alpaca_data_pre.pt \
--json_data_path data/alpaca_data.json \
--json_save_path alpaca_data_pre.json \
--sample_num 10 \
--kmeans_num_clusters 100 \
--low_th 25 \
--up_th 75
--pt_data_path
: The .pt
file from previous step containing needed embeddings or scores
--json_data_path
: The targeted dataset in the Alpaca format
--json_save_path
: The path to save the selected pre-experienced samples
--sample_num
: How many samples will be selected in each cluster
--kmeans_num_clusters
: How many clusters will be generated by K-Means
--low_th
and --up_th
: The lower and Upper threshold for selecting samples within each cluster
Train Pre-Experienced Model
Select Cherry Data
python cherry_seletion/data_analysis.py \
--data_path data/alpaca_data.json \
--save_path alpaca_data_cherry.pt \
--model_name_or_path <your_path_pre_experienced_model> \
--max_length 512 \
--prompt alpaca \
--mod cherry
python cherry_seletion/data_by_IFD.py \
--pt_data_path alpaca_data_cherry.pt \
--json_data_path data/alpaca_data.json \
--json_save_path alpaca_data_cherry.json \
--max_length 512 \
--sample_rate 0.06 \
--prompt alpaca
--sample_rate
: How many cherry samples you would like to select? You can also use --sample_number
to set the exact number of samples.
The following table provides a comparison between our cherry models and baseline models on the Huggingface Open LLM Leaderboard and AlpacaEval Leaderboard.
These results are based on cherry_data_v1. The prompt and training hyperparameters can be found in the Hyperparameters section.
These results verify the effectiveness of our method, which can be used to select the most valuable data samples for instruction tuning.
Avg | ARC | HellaSwag | MMLU | TruthfulQA | AlpacaEval | Data | Model | |||
---|---|---|---|---|---|---|---|---|---|---|
Alpaca | 50.21 | 42.65 | 76.91 | 41.73 | 39.55 | 26.46 | / | / | ||
5% Alpaca | 52.06 | 53.92 | 79.49 | 36.51 | 38.33 | 34.74 | [Link] | [hf-Link] | ||
10% Alpaca | / | / | / | / | / | / | [Link] | [hf-Link] | ||
15% Alpaca | / | / | / | / | / | / | [Link] | [hf-Link] | ||
WizardLM | 54.18 | 51.60 | 77.70 | 42.70 | 44.70 | 67.64 | / | / | ||
**WizardLM*** | 52.79 | 53.07 | 77.44 | 37.75 | 42.90 | 61.99 | [hf-Link] | [hf-Link] | ||
10% WizardLM | 51.59 | 52.90 | 78.95 | 33.08 | 41.41 | 61.44 | [Link] | [hf-Link] | ||
20% WizardLM | / | / | / | / | / | / | [Link] | [hf-Link] | ||
20% WizardLM | / | / | / | / | / | / | [Link] | [hf-Link] | ||
40% WizardLM | 52.83 | 53.07 | 77.79 | 35.29 | 45.17 | 65.09 | [Link] | [hf-Link] | ||
Also, the WizardLM filter script is provided here: [Link]
Thanks to the FastChat and flash-attention, we are able to run our experiments with longer length.
The above results are directly using cherry_data_v1 for finetuning the llama-2-7B model, with the length of 2048, and using original vicuna prompts.
Avg | ARC | HellaSwag | MMLU | TruthfulQA | AlpacaEval | Data | Model | |||
---|---|---|---|---|---|---|---|---|---|---|
WizardLM | 57.09 | 54.18 | 79.25 | 46.92 | 48.01 | 66.08 | / | [Link] | ||
10% WizardLM | 57.57 | 54.86 | 80.46 | 45.74 | 49.20 | 71.36 | [Link] | [Link] | ||
20% WizardLM | / | / | / | / | / | / | [Link] | [Link] | ||
20% WizardLM | 58.50 | 55.97 | 80.40 | 46.87 | 50.76 | 72.57 | [Link] | [Link] | ||
40% WizardLM | 58.00 | 56.23 | 80.22 | 46.15 | 49.37 | 70.52 | [Link] | [Link] | ||
Note: WizardLM in the above table is our implementation using FastChat code, prompt, and configuration.
Note: Due to the hardware limit, all our models are using the 7B model.
Note: For these llama2 models, we still use the cherry_data_v1 to ensure the effectiveness of our data. We will soon make the cherry_data_v2 which is based on llama2 available.
In this section, all the IFD scores are calculated on llama2-7b or llama2-13b models by using Vicuna's prompt. The training of pre-experienced models is discarded for more efficient usage. The performances are promising in the llama2 model even without a pre-experienced model, indicating the proficiency of our proposed IFD scores.
Avg | ARC | HellaSwag | MMLU | TruthfulQA | AlpacaEval | Data | Model | |||
---|---|---|---|---|---|---|---|---|---|---|
Alpaca-7b (llama2) | 55.25 | 54.35 | 78.65 | 47.02 | 40.98 | 27.75 | / | / | ||
5% Alpaca-7b (llama2) | 55.78 | 57.94 | 80.37 | 44.91 | 40.62 | 36.78 | / | / | ||
10% Alpaca-7b (llama2) | 56.31 | 58.02 | 80.42 | 46.64 | 40.18 | / | / | / | ||
15% Alpaca-7b (llama2) | 56.37 | 57.42 | 80.68 | 46.40 | 40.95 | / | / | / | ||
Alpaca-13b (llama2) | 58.78 | 57.59 | 81.98 | 54.05 | 41.49 | 35.00 | / | / | ||
5% Alpaca-13b (llama2) | 61.21 | 62.37 | 84.00 | 55.65 | 42.82 | 46.82 | / | / | ||
10% Alpaca-13b (llama2) | 61.02 | 62.97 | 83.88 | 55.29 | 41.93 | / | / | / | ||
15% Alpaca-13b (llama2) | 61.23 | 62.37 | 83.48 | 55.56 | 43.42 | / | / | / |
All the above models are trained using FastChat code and prompt.
Data with IFD scores will be released soon.
We release the codes and data for using GPT4 or chatGPT to evaluate and compare the performance between two LLMs. This method greatly eliminates the potential position bias of GPT4 and chatGPT. For details, please see AlpaGasus or our paper. We thank @Lichang-Chen and AlpaGasus repo for sharing the evaluation codes.
To use this code, please follow the below scripts:
bash scripts/do_eval_generation.sh
: The model automatically generates the responses for a given instruction in test datasets.
bash scripts/do_eval_generation_wrap.sh
: Wrap the response files of LLMs being compared.
bash scripts/do_eval.sh
: Use GPT4 or chatGPT for the evaluation.
bash scripts/do_review_eval_score.sh
: Parse the results and draw the figure.
More detailed illustrations will be updated. Feel free to drop me an email if you are urgent about it.
Comparing our models trained on selected data with models trained on full data. (a) Comparison between our model with 5% Alpaca data and the official Alpaca model. (b) Comparison between our model with 10% WizardLM data and the reimplemented WizardLM model. (c) Comparison between our model with 40% WizardLM data and the official WizardLM model. All these experiments use GPT4 as the judge. Each horizontal bar represents a comparison in a specific test set.
We used the following prompts for fine-tuning the cherry models with Alpaca data:
for examples with a non-empty input field:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
for examples with an empty input field:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:
We used the following prompts for fine-tuning the cherry models with Wizard data:
{instruction}
### Response:
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay | Warmup Rate |
---|---|---|---|---|---|---|
Cherry Models V1 (Alpaca) | 128 | 2e-5 | 3 | 512 | 0 | 0.03 |
Cherry Models V1 (WizardLM) | 128 | 2e-5 | 3 | 1024 | 0 | 0.03 |
--- | ---: | ---: | ---: | ---: | ---: | ---: |
Cherry Models V2 7B | 128 | 2e-5 | 3 | 2048 | 0 | 0.03 |
Cherry Models V2 13B | 128 | 1e-5 | 5 | 2048 | 0 | 0.03 |
Please consider citing our paper if you think our codes, data, or models are useful. Thank you!
@inproceedings{li-etal-2024-quantity,
title = "From Quantity to Quality: Boosting {LLM} Performance with Self-Guided Data Selection for Instruction Tuning",
author = "Li, Ming and
Zhang, Yong and
Li, Zhitao and
Chen, Jiuhai and
Chen, Lichang and
Cheng, Ning and
Wang, Jianzong and
Zhou, Tianyi and
Xiao, Jing",
editor = "Duh, Kevin and
Gomez, Helena and
Bethard, Steven",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-long.421",
pages = "7595--7628",
}
@inproceedings{li-etal-2024-superfiltering,
title = "Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning",
author = "Li, Ming and
Zhang, Yong and
He, Shwai and
Li, Zhitao and
Zhao, Hongyu and
Wang, Jianzong and
Cheng, Ning and
Zhou, Tianyi",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.769",
pages = "14255--14273",
}
@inproceedings{li-etal-2024-selective,
title = "Selective Reflection-Tuning: Student-Selected Data Recycling for {LLM} Instruction-Tuning",
author = "Li, Ming and
Chen, Lichang and
Chen, Jiuhai and
He, Shwai and
Gu, Jiuxiang and
Zhou, Tianyi",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
month = aug,
year = "2024",
address = "Bangkok, Thailand and virtual meeting",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-acl.958",
pages = "16189--16211",
}
@inproceedings{li2023reflectiontuning,
title={Reflection-Tuning: Recycling Data for Better Instruction-Tuning},
author={Ming Li and Lichang Chen and Jiuhai Chen and Shwai He and Tianyi Zhou},
booktitle={NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following},
year={2023},
url={https://openreview.net/forum?id=xaqoZZqkPU}
}
If you are interested in Data Selection for Instruction Tuning, please see Cherry_LLM and Superfiltering.
If you are interested in human/LLM-free Data Augmentation for Instruction Tuning, please see Mosaic-IT and RuleR.
If you are interested in Data Improvement for Instruction Tuning, please see Reflection_Tuning.
If you are interested in Knowledge Distillation in the LLM era, please see this Survey.