tianyi-lab / Superfiltering

[ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
123 stars 10 forks source link

Superfiltering: Weak-to-Strong Data Filtering (ACL'24)

Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning (ACL'24)
Chinese Version: [知乎]

overview

This is the repo for the Superfiltering project, which introduces a method astonishingly utilizes a small GPT2 (124M) model to successfully filter out the high-quality subset from the existing GPT4-generated instruction tuning dataset.

The repo contains:

(This repo partially originated from Cherry_LLM and Reflection_Tuning.)
(Feel free to email Ming (Homepage, Email) for any questions or feedback.)

News

Contents

Overview

Superfiltering

Instruction tuning is critical to improve LLMs but usually suffers from low-quality and redundant data. Data filtering for instruction tuning has proved important in improving both the efficiency and performance of the tuning process. But it also leads to extra cost and computation due to the involvement of LLMs in this process. To reduce the filtering cost, we study Superfiltering: Can we use a smaller and weaker model to select data for finetuning a larger and stronger model? Despite the performance gap between weak and strong language models, we find their highly consistent capability to perceive instruction difficulty and data selection results. This enables us to use a much smaller and more efficient model to filter the instruction data used to train a larger language model. Not only does it largely speed up the data filtering, but the filtered-data-finetuned LLM achieves even better performance on standard benchmarks. Extensive experiments validate the efficacy and efficiency of our approach.

overview

Top: Comparison of data filtering for instruction tuning of a student model. (a) The filter model is a strong proprietary LLM, e.g. ChatGPT, which can be time-consuming and expensive but usually performs promisingly. (b) The filter model is the student model itself or a similar-sized open-source LLM, which is still time-consuming but free to use. (c) Weak-to-strong Superfiltering proposed by this paper, which utilizes a much smaller filter model, e.g. GPT-2, to train a stronger student LLM. We find it costs much less time but maintains the performance.
Bottom: Comparisons of two student models finetuned using 5% data selected by LLaMA2-7B and GPT-2 from the Alpaca dataset. (d) Both models trained on 5% data outperform the baseline model trained on 100% data. (e) GPT-2 as the superfilter speeds up data filtering by 20 times.

Superfiltering with Diversity

Motivated by recent work that further includes Diversity metrics in the data selection process, we introduce an extended version of Superfiltering, Superfiltering with Diversity (Superfiltering.D). We hypothesize that the diversity metrics work better when implemented on a high-quality data subset than the whole dataset with mixed quality. Thus we propose to first utilize Superfiltering to select a subset with relatively high quality, then further utilize Facility Location Function to further compress the selected data number. Compared with other diversity metrics, the Facility Location Function can strike a balance between capturing diversity and ensuring the representation of different clusters or regions within the data, it ensures a global view of the given high-quality subset. To further preserve the efficiency of our Superfiltering.D, we utilize sentence-transformers/all-MiniLM-L6-v2 as the encoder, which only has approximately 80M parameters. In our preliminary experiments on the Alpaca and Alpaca-GPT4 dataset, where we first select 20% of the data by Superfiltering, then utilize the Facility Location Function to further select 2% of the data. The models trained with 2% of the data have a comparable or better performance than full data models.

The benefits of our Superfiltering.D:

  1. We can compress the data selected to 2%, which further greatly improves the efficiency of Instruction Tuning.
  2. This 2-step method, considering diversity only on the high-quality subset, relaxes the strong reliance on fancy encoders, ensuring that small encoders can work effectively.
  3. This 2-step method greatly improves the efficiency of the diversity metrics, both the encoder and the diversity metric only need to compute on a subset rather than the whole great dataset.

Highlights

Install

Install the dependencies with pip install -r requirements.txt

Note: The calculation of IFD scores only needs the transformers package, thus if you are using a different code base with transformers installed, you can directly run the code and manually install the missing packages.

Run Code

Superfiltering

  1. Calculate IFD scores
bash scripts/step1_select_data_analysis_gpt2.sh

--data_path: The targeted dataset in the Alpaca format.
--save_path: The path to save the .jsonl file containing scores.
--model_name_or_path: The model used for calculating IFD scores, we found gpt2 is good enough as illustrated in our paper. Also, you can use the model that you need to finetune, which would be a self-guided manner or student-involved manner.

  1. Put scores into the original data
    bash scripts/step2_put_analysis_to_data.sh

pt_data_path: The .jsonl file generated in last step.
json_data_path: The targeted dataset in the Alpaca format.
json_save_path: The data path to save the data with IFD scores.

Note: Steps 1 and 2 can be merged directly for better convenience.

  1. Select the data you wish.
    bash scripts/step3_select_data.sh

json_data_path: The data path to save the data with IFD scores.
json_save_path: The data path to save the data with IFD scores filtered.
sample_rate: How much data do you need? Here we only provide the percentage version, you can slightly modify the code to select the exact number you want.

Note: The Step 1 code is the batch_size=1 version, it takes about 15 minutes to process the whole Alpaca dataset. We release this version and split the whole process into 3 steps for better controllability. You can directly run the above 3 scripts to get a better understanding of our codes. It takes about 15 minutes for the whole process.

Superfiltering.D

To run Superfiltering.D, please first install the submodlib package here.
The step 1 and 2 are the same as the previous ones.

  1. Select the data with diversity.
    scripts/optional_select_data_plus_diversity.sh

json_data_path: The data path to save the data with IFD scores.
json_save_path: The data path to save the data with IFD scores filtered.
ifd_num: The number of data you want for the high-quality subset, selected by the Superfiltering.
fla_num: The number of data you want after implementing FacilityLocationFunction.

Note: In our preliminary experiments, setting ifd_num as 20% of the full data and fla_num as 2% of the full data works fine for both Alpaca and Alpaca-GPT4 datasets.
Further experiments will be conducted.

Data

The Alpaca Data with GPT2-based IFD scores can be found in data/data_with_ifd/alpaca_data_gpt2_data.json.
The Alpaca-GPT4 Data with GPT2-based IFD scores can be found in data/data_with_ifd/alpaca_gpt4_data_gpt2_data.json.

To select the subset data from these datasets, you can directly run bash scripts/step3_select_data.sh in above Step 3.

Evaluation

The codes and data for pair-wise comparison by using GPT4 are released in the evaluation folder. This method greatly eliminates the potential position bias of GPT4 and chatGPT.

To use this code, please follow the below scripts:

bash evaluation/scripts/do_eval_generation.sh: The model automatically generates the responses for a given instruction in test datasets.
bash evaluation/scripts/do_eval_generation_wrap.sh: Wrap the response files of LLMs being compared.
bash evaluation/scripts/do_eval.sh: Use GPT4 or chatGPT for the evaluation.
bash evaluation/scripts/do_review_eval_score.sh: Parse the results and draw the figure.

For other evaluation metrics, please see their official repo.

ToDo

Citation

Please consider citing our papers if you think our codes, data, or models are useful. Thank you!

@inproceedings{li-etal-2024-superfiltering,
    title = "Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning",
    author = "Li, Ming  and
      Zhang, Yong  and
      He, Shwai  and
      Li, Zhitao  and
      Zhao, Hongyu  and
      Wang, Jianzong  and
      Cheng, Ning  and
      Zhou, Tianyi",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.769",
    pages = "14255--14273",
}

@inproceedings{li-etal-2024-selective,
    title = "Selective Reflection-Tuning: Student-Selected Data Recycling for {LLM} Instruction-Tuning",
    author = "Li, Ming  and
      Chen, Lichang  and
      Chen, Jiuhai  and
      He, Shwai  and
      Gu, Jiuxiang  and
      Zhou, Tianyi",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.958",
    pages = "16189--16211",
}

@inproceedings{li-etal-2024-quantity,
  title = "From Quantity to Quality: Boosting {LLM} Performance with Self-Guided Data Selection for Instruction Tuning",
  author = "Li, Ming  and
    Zhang, Yong  and
    Li, Zhitao  and
    Chen, Jiuhai  and
    Chen, Lichang  and
    Cheng, Ning  and
    Wang, Jianzong  and
    Zhou, Tianyi  and
    Xiao, Jing",
  editor = "Duh, Kevin  and
    Gomez, Helena  and
    Bethard, Steven",
  booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
  month = jun,
  year = "2024",
  address = "Mexico City, Mexico",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2024.naacl-long.421",
  pages = "7595--7628",
}

@inproceedings{li2023reflectiontuning,
  title={Reflection-Tuning: Recycling Data for Better Instruction-Tuning},
  author={Ming Li and Lichang Chen and Jiuhai Chen and Shwai He and Tianyi Zhou},
  booktitle={NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following},
  year={2023},
  url={https://openreview.net/forum?id=xaqoZZqkPU}
}

Our Related Works

If you are interested in Data Selection for Instruction Tuning, please see Cherry_LLM and Superfiltering.
If you are interested in human/LLM-free Data Augmentation for Instruction Tuning, please see Mosaic-IT and RuleR.
If you are interested in Data Improvement for Instruction Tuning, please see Reflection_Tuning.
If you are interested in Knowledge Distillation in the LLM era, please see this Survey.

Star History

Star History Chart