huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.05k stars 1.07k forks source link

[RFC]Add Auto-Round Support #2130

Closed yiliu30 closed 3 months ago

yiliu30 commented 4 months ago

Hi, here is the INC team from Intel. Thank you for developing this amazing project.

Motivation

Our team has developed a new weight-only quantization algorithm called Auto-Round. It has achieved superior accuracy compared to GPTQ, AWQ, and OmniQuant across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our paper, GitHub repository, and Hugging Face low-bit quantization leaderboard.

Autoround-res-2models

We would like to contribute this quantization algorithm to TGI and enable users to:

  1. Quantize a floating model using Auto-Round.
  2. Perform inference with an AutoRound-quantized model.

1. Quantize Floating Model Using Auto-Round

Extend the current quantize API and add method as a new argument to select different algorithms. Users can utilize it as follows:

text-generation-server quantize \
    --MODEL_ID path/to/float/model/\
    --OUTPUT_DIR /path/to/save/quantized/model \
    --method autoround # <--- select the different methods, such as `gptq`, `autoround`

We propose two options to implement it:

Option 1: Adding Auto-Round as a New Python Dependency (Recommended)

Auto-Round is currently released as a pure Python binary. The option adds auto-round to TGI's requirements_xx.txt and calls Auto-Round's API to obtain the quantized model.

Advantages:

Option 2: Porting All Source Code of Auto-Round into TGI

We are also willing to integrate all source code of Auto-Round directly into TGI.

Advantages:

Here is the overall calling flow for the these two options:

# tgi/server/text_generation_server/layers/autoround/quantize.py

def quantize(
    model_id: str,
    bits: int,
    groupsize: int,
    output_dir: str,
    ...
):
    # Load model...
    model = ...
    # Start quantize model using auto-round
    # Import autoround from auto-round package for Option 1
    # Import autoround from the current folder for Option 2
    import autoround 
    rounder = autoround.AutoRound(model, ...)
    rounder.quantize(...)
    rounder.save_quantized(output_dir)

2. Perform Inference with an AutoRound-quantized Model.

We propose extending the current text-generation-launcher API to include autoround as a new option within --quantize. Users can utilize it as follows:

text-generation-launcher \
    --model-id INC/Llama-2-7b-Chat-Autoround \ # Quantized model using auto-round
    --trust-remote-code --port 8080 \
    --max-input-length 3072 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 \
    --quantize autoround   # <------ select autoround

Your feedback is important. Please feel free to comment on the options mentioned above or suggest additional approaches to ensure the most appropriate way to contribute :). Thank you in advance!

flozi00 commented 4 months ago

Hi, 2 questions from my side I saw you have a function to export as autogptq format, so it should be possible to do inference already or am I wrong ? Any quality loss to expect when doing this ?

And the seconds one is about 8bit quants, do you already have some benchmarks ?

I would be interested in integrating this :)

danieldk commented 4 months ago

Awesome work!

Question about inference: I see that you also support exporting to GPTQ format. What is the benefit of adding an extra autoround option for inference?

(Some background: we are considering to switch to GPTQ-Marlin for supported configurations, since we see much-improved throughput.)

danieldk commented 4 months ago

Question about inference: I see that you also support exporting to GPTQ format. What is the benefit of adding an extra autoround option for inference?

Ah, I see you fixed the asymmetric quantization negative zeros bug. There is work on fixing this with a new gptq format revision, which you have probably seen. Linking it just in case: https://github.com/AutoGPTQ/AutoGPTQ/pull/640

yiliu30 commented 4 months ago

I saw you have a function to export as autogptq format, so it should be possible to do inference already or am I wrong ? Any quality loss to expect when doing this ?

Question about inference: I see that you also support exporting to GPTQ format. What is the benefit of adding an extra autoround option for inference?

Hi @flozi00 and @danieldk ,

Yes, AutoRound supports exporting to AutoGPTQ format without quality loss, but it is for pure 4-bit or 8-bit models.

Currently, AutoGPTQ can only apply a single quantization configuration for all target layers. In contrast, we support quantizing models with mixed various bits to balance accuracy and inference speed.

Additionally, auto-round supports quantizing the lm_head, which can save memory with negligible accuracy loss. For example, it can reduce the model size by more than 10% for W4G128 LLAMA3-8B.

We are also exploring other data types such as w4a4, w4a8, and mx format.

cc @wenhuach21

yiliu30 commented 4 months ago

And the seconds one is about 8bit quants, do you already have some benchmarks ?

While 8-bit quantization is supported, extensive benchmarks have not been conducted because our 4-bit quantization results are already quite well :). If you are interested, we are happy to conduct some tests.

yiliu30 commented 4 months ago

Question about inference: I see that you also support exporting to GPTQ format. What is the benefit of adding an extra autoround option for inference?

Ah, I see you fixed the asymmetric quantization negative zeros bug. There is work on fixing this with a new gptq format revision, which you have probably seen. Linking it just in case: AutoGPTQ/AutoGPTQ#640

Thank you for sharing this information. AutoRound supports exporting the format compatible with GPTQ-Marlin kernel as well.

danieldk commented 4 months ago

Yes, AutoRound supports exporting to AutoGPTQ format without quality loss, but it is for pure 4-bit or 8-bit models.

Currently, AutoGPTQ can only apply a single quantization configuration for all target layers. In contrast, we support quantizing models with mixed various bits to balance accuracy and inference speed.

That's really nice! I wonder if 'mixed-bitness' could be considered for a GPTQ v2 format as well. I think ideally, every quantizer that uses quantized weights, scales, biases, and scale grouping would use the same GPTQ-based format. This would allow us to switch out kernels when new options become available. There has been a lot of development in this space and every improvement in GPTQ inference performance has benefitted all GPTQ format-based models.

With respect to training, text-generation-server quantize exists primarily for historical reasons, but we think it's best for users to use AutoGPTQ, Optimum, etc. to quantize models.

wenhuach21 commented 4 months ago

Yes, AutoRound supports exporting to AutoGPTQ format without quality loss, but it is for pure 4-bit or 8-bit models. Currently, AutoGPTQ can only apply a single quantization configuration for all target layers. In contrast, we support quantizing models with mixed various bits to balance accuracy and inference speed.

That's really nice! I wonder if 'mixed-bitness' could be considered for a GPTQ v2 format as well. I think ideally, every quantizer that uses quantized weights, scales, biases, and scale grouping would use the same GPTQ-based format. This would allow us to switch out kernels when new options become available. There has been a lot of development in this space and every improvement in GPTQ inference performance has benefitted all GPTQ format-based models.

With respect to training, text-generation-server quantize exists primarily for historical reasons, but we think it's best for users to use AutoGPTQ, Optimum, etc. to quantize models.

Yes, I agree. From the TGI side, this is the ideal scenario. However, unifying everything is challenging due to various reasons, similar to the existence of multiple LLM serving frameworks. For example, AutoGPTQ limits their calibration dataset to about three datasets and throws an error if you specify others. Additionally, GPTQV2's pull request has been open for 2-3 months with no indication of whether it will be merged.

For AutoRound, we have currently specified the backend name (e.g., gptq:exllamav2 or others https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc-lmhead/blob/main/config.json ) and will switch to the GPTQ backend as the default CUDA kernel once their issue is fixed. Therefore, I believe TGI should have little difficulty switching to a better option in the future for all the gptq based models.

flozi00 commented 4 months ago

I tried to export a model using the gptq format but it seems to be not marlin compatible. Could you specify how to export the marlin compatible weights ?

bits, group_size, sym = 8, 128, True
    autoround = AutoRound(
        model,
        tokenizer,
        bits=bits,
        group_size=group_size,
        sym=sym,
        batch_size=8,
        device="auto",
        iters=1_0,
        n_samples=1_0,
        dataset="pL-Community/EduScorerGerman",
        backend="gptq:marlin",
    )
    autoround.quantize()
    output_dir = "./tmp_autoround"
    autoround.save_quantized(output_dir, format="auto_gptq")
wenhuach21 commented 4 months ago

I tried to export a model using the gptq format but it seems to be not marlin compatible. Could you specify how to export the marlin compatible weights ?

Yes, sorry for the inconvenience. We will provide support within 1-2 days and keep you updated.
Since this backend is not yet supported in Transformers and only supports symmetric quantization, we haven't provided the API to users.

flozi00 commented 4 months ago

I just started a quantization with 1000 samples and iters Hopefully it can be loaded with TGI Marlin gptq support https://github.com/huggingface/text-generation-inference/pull/2111 Will report here

wenhuach21 commented 4 months ago

For debugging, use 32 samples with 2 iterations and disable_low_gpu_mem_usage for much faster performance. By default, we use 512 samples and 200 iterations and will support a fast config soon. Additionally, I think --sym needs to be added.

Currently, the packing format is Triton in AutoRound, the same as exllamav2. I'm not sure whether TGI supports the conversion between marlin with exllamav2, as they are different formats to my knowledge.

wenhuach21 commented 4 months ago

https://github.com/intel/auto-round/pull/168 with sym, I have verified marlin on opt-125m, will verify llama3 later

wenhuach21 commented 4 months ago

@flozi00 Hi, I have fixed the issue, please have a double check.

auto-gptq==0.7.1

text = "There is a girl who likes adventure,"

opt125m Transformers API: There is a girl who likes adventure, and she is a girl who likes adventure. I'm not sure if you're being sarcastic or not, but I'm pretty sure you're being sarcastic. I'm not sure if you're being sarcastic or not, but I'm pretty sure

opt125m AutoGPTQ marlin API: "There is a girl who likes adventure, and she is a girl who likes adventure. I'm not sure if you're being sarcastic or not, but I'm pretty sure you're being sarcastic. I'm not sure if you're being sarcastic or not, but I'm pretty sure"

LLAMA3-8B-Instruct Transformers API: There is a girl who likes adventure, and she is always ready to take on new challenges. She is a true adventurer at heart, and she loves to explore new places and try new things. She is also very brave and never backs down from a challenge, even if it seems scary or

LLAMA3-8B-Instruct AutoGPTQ marlin API: There is a girl who likes adventure, and she is always ready to take on new challenges. She is a true adventurer at heart, and she loves to explore new places and try new things. She is also very brave and never backs down from a challenge, even if it seems scary or

reference cmd

CUDA_VISIBLE_DEVICES=0 \
python3 main.py \
--model_name  $model_name \
--nsamples 128 \
--seqlen 512 \
--sym  \
--disable_low_gpu_mem_usage \
--disable_eval \
--deployment_device 'gpu' \

We can support exporting to the Marlin format directly if needed, due to the repacking process

flozi00 commented 4 months ago

That would make it a lot easier At the moment i still need to repack it after the export to be marlin tgi compatible

wenhuach21 commented 4 months ago

That would make it a lot easier At the moment i still need to repack it after the export to be marlin tgi compatible

Sure, we will support it tomorrow.

flozi00 commented 4 months ago

I can confirm that the Marlin Kernels for gptq in tgi are working with the exported Models from the Auto round Main branch

wenhuach21 commented 4 months ago

@flozi00

We have added support for packing directly to the AutoRound format in https://github.com/intel/auto-round/pull/172 by setting --deployment_device 'auto_round:marlin' in our latest update. This feature will be merged after extensive testing.

Regarding exporting to the AutoGPTQ format, we found that with the current AutoGPTQ API, it still conducts repacking even with marlin format. Therefore, we do not plan to support this format as exporting to ExLlamaV2 is more compatible.

test_result: LLama3-8B-Instruct: There is a girl who likes adventure, and she is always ready to take on new challenges. She is a free spirit, and she loves to explore new places and try new things. She is also very curious, and she loves to learn new things. She is a bit of a thrill

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.