microsoft / TransformerCompression

For releasing code related to compression methods for transformers, accompanying our publications
MIT License
354 stars 31 forks source link

Transformer Compression with SliceGPT

This repository contains the code for the paper SliceGPT (ICLR'24). Also discussed on Hugging Face.

SliceGPT is a new post-training sparsification scheme that makes transformer networks (including LLMs) smaller by first applying orthogonal transformations to each transformer layer that leave the model unchanged, and then slicing off the least-significant rows and columns (chosen by the eigenvalue decay) of the weight matrices. The model structure is left unchanged, but each weight matrix is replaced by a smaller (dense) weight matrix, reducing the embedding dimension of the model. This results in speedups (without any additional code optimization) and a reduced memory footprint.

The code is arranged as a package slicegpt in /src, and scripts to replicate experiments from the paper are in /experiments. To install the slicegpt package, we recommend

    pip install -e .[experiment]

Running SliceGPT

To run SliceGPT on microsoft/phi-2, from the experiments folder, run

    python run_slicegpt.py \
           --model microsoft/phi-2 \
           --save-dir dir/to/save/sliced_model/in \
           --sparsity 0.25 \
           --device cuda:0 \
           --eval-baseline \
           --no-wandb

This will compress the microsoft/phi-2 model and save the compressed model to the specified directory. Please consult the script for the full set of options.

Note: For models that require Hugging Face authentication, set the --hf-token argument manually or using a key vault. Alternatively, set the environment variable HF_TOKEN.

Recovery fine-tuning

To install additional dependencies required for post-slicing recovery fine-tuning (RFT):

    pip install -e .[experiment,finetune]

The following replicates the experiments in the paper (LoRA hyperparams valid for all Llama-2 and Phi-2 models):

    python run_finetuning.py \
           --model microsoft/phi-2 \
           --sliced-model-path path/to/sliced \
           --save-dir dir/to/save/finetuned_model/in \
           --sparsity 0.25 \
           --device cuda:0 \
           --ppl-eval-dataset alpaca \
           --finetune-dataset alpaca \
           --finetune-train-nsamples 8000 \
           --finetune-train-seqlen 1024 \
           --finetune-train-batch-size 3 \
           --lora-alpha 10 \
           --lora-r 32 \
           --lora-dropout 0.05 \
           --lora-target-option attn_head_and_mlp \
           --eval-steps 16 \
           --save-steps 16 \
           --no-wandb

Notes:

Evaluation using the LM Eval Harness

    python run_lm_eval.py \
           --model microsoft/phi-2 \
           --sliced-model-path path/to/sliced \
           --sparsity 0.25 \
           --tasks piqa \
           --no-wandb

Notes:

Supported models

The following models from Hugging Face hub are currently supported

Extending support to a new model type

The model you wish to support must be in Hugging Face Hub format. The model files can be downloaded from Hugging Face Hub by supplying --model argument, or accessed from local storage by using the --model and --model-path argument. To add SliceGPT support for a new model, one needs to implement a new model adapter and update hf_utils.get_model_and_tokenizer before slicing the new model.

Implementing a new model adapter

Example: llama_adapter.py

Using a new model adapter to slice a model

Once a model adapter is implemented, compressing the model involves three conceptual steps:

Example: run_slicegpt.py

Note: If the model you wish to support is not available in Hugging Face, you will also need to implement custom model loading and initialization functionality.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.

Any use of third-party trademarks or logos are subject to those third-party's policies.