huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.32k stars 939 forks source link

[RFC]Add Auto-Round Support #2130

Open yiliu30 opened 4 days ago

yiliu30 commented 4 days ago

Hi, here is the INC team from Intel. Thank you for developing this amazing project.

Motivation

Our team has developed a new weight-only quantization algorithm called Auto-Round. It has achieved superior accuracy compared to GPTQ, AWQ, and OmniQuant across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our paper, GitHub repository, and Hugging Face low-bit quantization leaderboard.

Autoround-res-2models

We would like to contribute this quantization algorithm to TGI and enable users to:

  1. Quantize a floating model using Auto-Round.
  2. Perform inference with an AutoRound-quantized model.

1. Quantize Floating Model Using Auto-Round

Extend the current quantize API and add method as a new argument to select different algorithms. Users can utilize it as follows:

text-generation-server quantize \
    --MODEL_ID path/to/float/model/\
    --OUTPUT_DIR /path/to/save/quantized/model \
    --method autoround # <--- select the different methods, such as `gptq`, `autoround`

We propose two options to implement it:

Option 1: Adding Auto-Round as a New Python Dependency (Recommended)

Auto-Round is currently released as a pure Python binary. The option adds auto-round to TGI's requirements_xx.txt and calls Auto-Round's API to obtain the quantized model.

Advantages:

Option 2: Porting All Source Code of Auto-Round into TGI

We are also willing to integrate all source code of Auto-Round directly into TGI.

Advantages:

Here is the overall calling flow for the these two options:

# tgi/server/text_generation_server/layers/autoround/quantize.py

def quantize(
    model_id: str,
    bits: int,
    groupsize: int,
    output_dir: str,
    ...
):
    # Load model...
    model = ...
    # Start quantize model using auto-round
    # Import autoround from auto-round package for Option 1
    # Import autoround from the current folder for Option 2
    import autoround 
    rounder = autoround.AutoRound(model, ...)
    rounder.quantize(...)
    rounder.save_quantized(output_dir)

2. Perform Inference with an AutoRound-quantized Model.

We propose extending the current text-generation-launcher API to include autoround as a new option within --quantize. Users can utilize it as follows:

text-generation-launcher \
    --model-id INC/Llama-2-7b-Chat-Autoround \ # Quantized model using auto-round
    --trust-remote-code --port 8080 \
    --max-input-length 3072 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 \
    --quantize autoround   # <------ select autoround

Your feedback is important. Please feel free to comment on the options mentioned above or suggest additional approaches to ensure the most appropriate way to contribute :). Thank you in advance!

flozi00 commented 4 days ago

Hi, 2 questions from my side I saw you have a function to export as autogptq format, so it should be possible to do inference already or am I wrong ? Any quality loss to expect when doing this ?

And the seconds one is about 8bit quants, do you already have some benchmarks ?

I would be interested in integrating this :)

danieldk commented 4 days ago

Awesome work!

Question about inference: I see that you also support exporting to GPTQ format. What is the benefit of adding an extra autoround option for inference?

(Some background: we are considering to switch to GPTQ-Marlin for supported configurations, since we see much-improved throughput.)

danieldk commented 4 days ago

Question about inference: I see that you also support exporting to GPTQ format. What is the benefit of adding an extra autoround option for inference?

Ah, I see you fixed the asymmetric quantization negative zeros bug. There is work on fixing this with a new gptq format revision, which you have probably seen. Linking it just in case: https://github.com/AutoGPTQ/AutoGPTQ/pull/640

yiliu30 commented 3 days ago

I saw you have a function to export as autogptq format, so it should be possible to do inference already or am I wrong ? Any quality loss to expect when doing this ?

Question about inference: I see that you also support exporting to GPTQ format. What is the benefit of adding an extra autoround option for inference?

Hi @flozi00 and @danieldk ,

Yes, AutoRound supports exporting to AutoGPTQ format without quality loss, but it is for pure 4-bit or 8-bit models.

Currently, AutoGPTQ can only apply a single quantization configuration for all target layers. In contrast, we support quantizing models with mixed various bits to balance accuracy and inference speed.

Additionally, auto-round supports quantizing the lm_head, which can save memory with negligible accuracy loss. For example, it can reduce the model size by more than 10% for W4G128 LLAMA3-8B.

We are also exploring other data types such as w4a4, w4a8, and mx format.

cc @wenhuach21

yiliu30 commented 3 days ago

And the seconds one is about 8bit quants, do you already have some benchmarks ?

While 8-bit quantization is supported, extensive benchmarks have not been conducted because our 4-bit quantization results are already quite well :). If you are interested, we are happy to conduct some tests.

yiliu30 commented 3 days ago

Question about inference: I see that you also support exporting to GPTQ format. What is the benefit of adding an extra autoround option for inference?

Ah, I see you fixed the asymmetric quantization negative zeros bug. There is work on fixing this with a new gptq format revision, which you have probably seen. Linking it just in case: AutoGPTQ/AutoGPTQ#640

Thank you for sharing this information. AutoRound supports exporting the format compatible with GPTQ-Marlin kernel as well.

danieldk commented 3 days ago

Yes, AutoRound supports exporting to AutoGPTQ format without quality loss, but it is for pure 4-bit or 8-bit models.

Currently, AutoGPTQ can only apply a single quantization configuration for all target layers. In contrast, we support quantizing models with mixed various bits to balance accuracy and inference speed.

That's really nice! I wonder if 'mixed-bitness' could be considered for a GPTQ v2 format as well. I think ideally, every quantizer that uses quantized weights, scales, biases, and scale grouping would use the same GPTQ-based format. This would allow us to switch out kernels when new options become available. There has been a lot of development in this space and every improvement in GPTQ inference performance has benefitted all GPTQ format-based models.

With respect to training, text-generation-server quantize exists primarily for historical reasons, but we think it's best for users to use AutoGPTQ, Optimum, etc. to quantize models.

wenhuach21 commented 3 days ago

Yes, AutoRound supports exporting to AutoGPTQ format without quality loss, but it is for pure 4-bit or 8-bit models. Currently, AutoGPTQ can only apply a single quantization configuration for all target layers. In contrast, we support quantizing models with mixed various bits to balance accuracy and inference speed.

That's really nice! I wonder if 'mixed-bitness' could be considered for a GPTQ v2 format as well. I think ideally, every quantizer that uses quantized weights, scales, biases, and scale grouping would use the same GPTQ-based format. This would allow us to switch out kernels when new options become available. There has been a lot of development in this space and every improvement in GPTQ inference performance has benefitted all GPTQ format-based models.

With respect to training, text-generation-server quantize exists primarily for historical reasons, but we think it's best for users to use AutoGPTQ, Optimum, etc. to quantize models.

Yes, I agree. From the TGI side, this is the ideal scenario. However, unifying everything is challenging due to various reasons, similar to the existence of multiple LLM serving frameworks. For example, AutoGPTQ limits their calibration dataset to about three datasets and throws an error if you specify others. Additionally, GPTQV2's pull request has been open for 2-3 months with no indication of whether it will be merged.

For AutoRound, we have currently specified the backend name (e.g., gptq:exllamav2 or others https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc-lmhead/blob/main/config.json ) and will switch to the GPTQ backend as the default CUDA kernel once their issue is fixed. Therefore, I believe TGI should have little difficulty switching to a better option in the future for all the gptq based models.