mit-han-lab / qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Apache License 2.0
326 stars 8 forks source link

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

[Paper] [LMQuant Quantization Algorithm Library] [Website]

QServe: Efficient and accurate LLM serving system on GPUs with W4A8KV4 quantization (4-bit weights, 8-bit activations, and 4-bit KV cache). Compared with leading industry solution TensorRT-LLM, QServe achieves 1.2x-1.4x higher throughput when serving Llama-3-8B, and 2.4x-3.5x higher throughput when serving Qwen1.5-72B, on L40S and A100 GPUs. QServe also allows users to achieve A100-level throughput on 3x cheaper L40S GPUs.

teaser efficiency

Introduction

Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2× on A100, 1.4× on L40S; and Qwen1.5-72B by 2.4× on A100, 3.5× on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by .

The current release supports:

News

Contents

Installation

  1. Clone this repository and navigate to the corresponding folder:

    git clone https://github.com/mit-han-lab/qserve
    cd qserve
  2. Install QServe

    conda create -n QServe python=3.10 -y
    conda activate QServe
    pip install --upgrade pip  # enable PEP 660 support
    pip install -e .
    pip install flash-attn --no-build-isolation

    We recommend starting an interactive python CLI interface and run

    import flash_attn

    to check whether FlashAttention-2 is installed successfully. If not, we recommend downloading pre-built wheels from here. Please notice:

  1. Compile the CUDA kernels.

    cd kernels
    python setup.py install
  2. If you want to clone our model zoo, please make sure that git-lfs is installed.

QServe Model Zoo

We provide pre-quantized checkpoints for multiple model families. For example, for Llama-3-8B model, please run the following commands to download:

# git lfs install  # install git lfs if not already
mkdir -p qserve_checkpoints && cd qserve_checkpoints
git clone https://huggingface.co/mit-han-lab/Llama-3-8B-QServe 

For other models, please refer to the detailed support list for the links to download:

Models W4A8-per-channel W4A8-g128
Llama3 8B/70B 8B/70B
Llama3-Instruct 8B/70B 8B/70B
Llama2 7B/13B/70B 7B/13B/70B
Vicuna 7B/13B/30B 7B/13B/30B
Mistral 7B 7B
Yi 34B 34B
Qwen ✅ 72B ✅ 72B

For flagship datacenter GPUs such as the A100, it is recommended to use QServe-per-channel, while for inference datacenter GPUs like the L40S, QServe-per-group is the recommended approach.

If you are interested in generating the quantized checkpoints on your own, please follow the instructions in LMQuant to run QoQ quantization and dump the fake-quantized models. We then provide checkpoint converter to real-quantize and pack the model into QServe format:

python checkpoint_converter.py --model-path <hf-model-path> --quant-path <fake-quant-model-path> --group-size -1 --device cpu
# <fake-quant-model-path> is a directory generated by LMQuant, including model.pt and scale.pt

We also provide a script to run the checkpoint converter. The final model will be automatically stored under qserve_checkpoints.

Usage and Examples

We support both offline benchmarking and online generation (in-flight-batching) in QServe.

  1. Offline speed benchmarking (Batched input sequences, fixed context length = 1024 and generation length = 512). We take Llama-3-8B (per-channel quant) as an example here. Please make sure that you have already downloaded the QoQ-quantized QServe model.
export MODEL_PATH=./qserve_checkpoints/Llama-3-8B-QServe # Please set the path accordingly

GLOBAL_BATCH_SIZE=128 \
python qserve_benchmark.py \
  --model $MODEL_PATH \
  --benchmarking \
  --precision w4a8kv4 \
  --group-size -1

If you hope to use larger batch sizes such as 256, you may need to change NUM_GPU_PAGE_BLOCKS to a larger value than the automatically-determined value on A100. For example:

export MODEL_PATH=./qserve_checkpoints/Llama-3-8B-QServe # Please set the path accordingly

GLOBAL_BATCH_SIZE=256 \
NUM_GPU_PAGE_BLOCKS=6400 \
python qserve_benchmark.py \
  --model $MODEL_PATH \
  --benchmarking \
  --precision w4a8kv4 \
  --group-size -1
  1. This is an online demonstration of batched generation, showcasing in-flight batching, paged attention of W4A8KV4 QoQ LLMs. We will randomly sample a set of safety-moderated conversations from the WildChat dataset and process them efficiently through in-flight batching.
export MODEL_PATH=./qserve_checkpoints/Llama-3-8B-Instruct-QServe # Please set the path accordingly

python qserve_e2e_generation.py \
  --model $MODEL_PATH \
  --ifb-mode \
  --precision w4a8kv4 \
  --quant-path $MODEL_PATH \
  --group-size -1
  1. Argument list in QServe

    Below are some frequently used arguments in QServe interface:

  1. One-line scripts:

We also provide sample scripts in QServe.

These scripts are expected to be executed in the QServe project folder (not in the scripts folder). Please note that git-lfs is needed for downloading QServe benchmark config files from huggingface before running the benchmark scripts.

Results

We evaluate QServe W4A8KV4 quantization on a wide range of mainstream LLMs. QServe consistently outperforms existing W4A4 or W4A8 solutions from the accuracy perspective, while providing State-of-the-Art LLM serving efficiency.

Efficiency Benchmarks

When serving the large language models Llama-3-8B and Qwen1.5-72B on L40S and A100 GPUs, QServe demonstrates superior performance, achieving 1.2x-1.4x higher throughput compared to the leading industry solution, TensorRT-LLM, for Llama-3-8B, and a 2.4x-3.5x higher throughput for Qwen1.5-72B. It is also able to deliver higher throughput and accomodate the same batch size on L40S compared with TensorRT-LLM on A100 for six of eight models benchmarked, effectively saving the dollar cost of LLM serving by around 3x.

Benchmarking setting: the criterion is maximum achieveable throughput on NVIDIA GPUs, and the input context length is 1024 tokens, output generation length is 512 tokens. For all systems that support paged attention, we enable this feature. In-flight batching is turned off in the efficiency benchmarks.

L40S (48G) Llama-3-8B Llama-2-7B Mistral-7B Llama-2-13B Llama-30B Yi-34B Llama-2-70B Qwen-1.5-72B
TRT-LLM-FP16 1326 444 1566 92 OOM OOM OOM OOM
TRT-LLM-W4A16 1431 681 1457 368 148 313 119 17
TRT-LLM-W8A8 2634 1271 2569 440 123 364 OOM OOM
Atom-W4A4 -- 2120 -- -- -- -- -- --
QuaRot-W4A4 -- 805 -- 413 133 -- -- 15
QServe-W4A8KV4 3656 2394 3774 1327 504 869 286 59
Throughput Increase* 1.39x 1.13x 1.47x 3.02x 3.41x 2.39x 2.40x 3.47x
A100 (80G) Llama-3-8B Llama-2-7B Mistral-7B Llama-2-13B Llama-30B Yi-34B Llama-2-70B Qwen-1.5-72B
TRT-LLM-FP16 2503 1549 2371 488 80 145 OOM OOM
TRT-LLM-W4A16 2370 1549 2403 871 352 569 358 143
TRT-LLM-W8A8 2396 2334 2427 1277 361 649 235 53
Atom-W4A4 -- 1160 -- -- -- -- -- --
QuaRot-W4A4 -- 1370 -- 289 267 -- -- 68
QServe-W4A8KV4 3005 2908 2970 1741 749 803 419 340
Throughput Increase* 1.20x 1.25x 1.22x 1.36x 2.07x 1.23x 1.17x 2.38x

The absolute token generation throughputs of QServe and baseline systems (Unit: tokens/second. -- means unsupported). All experiments were conducted under the same device memory budget. Throughput increase of QServe is calculated with regard to the best baseline in each column. It is recommended to use QServe-per-channel on high-end datacenter GPUs like A100 and QServe-per-group is recommended on inference GPUs like L40S.

Max throughput batch sizes used by QServe: Device Llama-3-8B Llama-2-7B Mistral-7B Llama-2-13B Llama-30B Yi-34B Llama-2-70B Qwen-1.5-72B
L40S 128 128 128 75 32 64 24 4
A100 256 190 256 128 64 196 96 32

We recommend direcly setting the NUM_GPU_PAGE_BLOCKS environmental variable to 25 * batch size, since in our benchmarking setting we have a context length of 1024 and generation length of 512, which corresponds to 24 pages (each page contains 64 tokens). We leave some buffer by allocating one more page for each sequence.

Accuracy Evaluation

QServe also maintains high accuracy thanks to the QoQ algorithm provided in our LMQuant quantization library.

Below is the WikiText2 perplexity evaluated with 2048 sequence length. The lower is the better.

Models Precision Llama-3 8B Llama-2 7B Llama-2 13B Llama-2 70B Llama 7B Llama 13B Llama 30B Mistral 7B Yi 34B
FP16 6.14 5.47 4.88 3.32 5.68 5.09 4.10 5.25 4.60
SmoothQuant W8A8 6.28 5.54 4.95 3.36 5.73 5.13 4.23 5.29 4.69
GPTQ-R W4A16 g128 6.56 5.63 4.99 3.43 5.83 5.20 4.22 5.39 4.68
AWQ W4A16 g128 6.54 5.60 4.97 3.41 5.78 5.19 4.21 5.37 4.67
QuaRot W4A4 8.33 6.19 5.45 3.83 6.34 5.58 4.64 5.77 NaN
Atom W4A4 g128 7.76 6.12 5.31 3.73 6.25 5.52 4.61 5.76 4.97
QoQ W4A8KV4 6.89 5.75 5.12 3.52 5.93 5.28 4.34 5.45 4.74
QoQ W4A8KV4 g128 6.76 5.70 5.08 3.47 5.89 5.25 4.28 5.42 4.76

* SmoothQuant is evaluated with per-tensor static KV cache quantization.

Reference

If you find QServe useful or relevant to your research and work, please kindly cite our paper:

@article{lin2024qserve,
  title={QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving},
  author={Lin*, Yujun and Tang*, Haotian and Yang*, Shang and Zhang, Zhekai and Xiao, Guangxuan and Gan, Chuang and Han, Song},
  journal={arXiv preprint arXiv:2405.04532},
  year={2024}
}

Team

The QServe serving library is maintained by the following research team:

Related Projects

The following projects are highly related to QServe. Our group has developed full-stack application-algorithm-system-hardware support for efficient large models, receiving 9k+ GitHub stars and over 1M Huggingface community downloads.

You are also welcome to check out MIT HAN LAB for other exciting projects on Efficient Generative AI!

Acknowledgement

We thank Julien Demouth, Jun Yang, and Dongxu Yang from NVIDIA for the helpful discussions. QServe is also inspired by many open-source libraries, including (but not limited to) TensorRT-LLM, vLLM, vLLM-SmoothQuant, FlashAttention-2, LMDeploy, TorchSparse++, GPTQ, QuaRot and Atom.