Closed yiliu30 closed 3 months ago
Hi, 2 questions from my side I saw you have a function to export as autogptq format, so it should be possible to do inference already or am I wrong ? Any quality loss to expect when doing this ?
And the seconds one is about 8bit quants, do you already have some benchmarks ?
I would be interested in integrating this :)
Awesome work!
Question about inference: I see that you also support exporting to GPTQ format. What is the benefit of adding an extra autoround
option for inference?
(Some background: we are considering to switch to GPTQ-Marlin for supported configurations, since we see much-improved throughput.)
Question about inference: I see that you also support exporting to GPTQ format. What is the benefit of adding an extra
autoround
option for inference?
Ah, I see you fixed the asymmetric quantization negative zeros bug. There is work on fixing this with a new gptq format revision, which you have probably seen. Linking it just in case: https://github.com/AutoGPTQ/AutoGPTQ/pull/640
I saw you have a function to export as autogptq format, so it should be possible to do inference already or am I wrong ? Any quality loss to expect when doing this ?
Question about inference: I see that you also support exporting to GPTQ format. What is the benefit of adding an extra
autoround
option for inference?
Hi @flozi00 and @danieldk ,
Yes, AutoRound supports exporting to AutoGPTQ format without quality loss, but it is for pure 4-bit or 8-bit models.
Currently, AutoGPTQ can only apply a single quantization configuration for all target layers. In contrast, we support quantizing models with mixed various bits to balance accuracy and inference speed.
Additionally, auto-round supports quantizing the lm_head
, which can save memory with negligible accuracy loss. For example, it can reduce the model size by more than 10% for W4G128 LLAMA3-8B
.
We are also exploring other data types such as w4a4
, w4a8
, and mx format
.
cc @wenhuach21
And the seconds one is about 8bit quants, do you already have some benchmarks ?
While 8-bit quantization is supported, extensive benchmarks have not been conducted because our 4-bit quantization results are already quite well :). If you are interested, we are happy to conduct some tests.
Question about inference: I see that you also support exporting to GPTQ format. What is the benefit of adding an extra
autoround
option for inference?Ah, I see you fixed the asymmetric quantization negative zeros bug. There is work on fixing this with a new gptq format revision, which you have probably seen. Linking it just in case: AutoGPTQ/AutoGPTQ#640
Thank you for sharing this information. AutoRound supports exporting the format compatible with GPTQ-Marlin kernel as well.
Yes, AutoRound supports exporting to AutoGPTQ format without quality loss, but it is for pure 4-bit or 8-bit models.
Currently, AutoGPTQ can only apply a single quantization configuration for all target layers. In contrast, we support quantizing models with mixed various bits to balance accuracy and inference speed.
That's really nice! I wonder if 'mixed-bitness' could be considered for a GPTQ v2 format as well. I think ideally, every quantizer that uses quantized weights, scales, biases, and scale grouping would use the same GPTQ-based format. This would allow us to switch out kernels when new options become available. There has been a lot of development in this space and every improvement in GPTQ inference performance has benefitted all GPTQ format-based models.
With respect to training, text-generation-server quantize
exists primarily for historical reasons, but we think it's best for users to use AutoGPTQ, Optimum, etc. to quantize models.
Yes, AutoRound supports exporting to AutoGPTQ format without quality loss, but it is for pure 4-bit or 8-bit models. Currently, AutoGPTQ can only apply a single quantization configuration for all target layers. In contrast, we support quantizing models with mixed various bits to balance accuracy and inference speed.
That's really nice! I wonder if 'mixed-bitness' could be considered for a GPTQ v2 format as well. I think ideally, every quantizer that uses quantized weights, scales, biases, and scale grouping would use the same GPTQ-based format. This would allow us to switch out kernels when new options become available. There has been a lot of development in this space and every improvement in GPTQ inference performance has benefitted all GPTQ format-based models.
With respect to training,
text-generation-server quantize
exists primarily for historical reasons, but we think it's best for users to use AutoGPTQ, Optimum, etc. to quantize models.
Yes, I agree. From the TGI side, this is the ideal scenario. However, unifying everything is challenging due to various reasons, similar to the existence of multiple LLM serving frameworks. For example, AutoGPTQ limits their calibration dataset to about three datasets and throws an error if you specify others. Additionally, GPTQV2's pull request has been open for 2-3 months with no indication of whether it will be merged.
For AutoRound, we have currently specified the backend name (e.g., gptq:exllamav2 or others https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc-lmhead/blob/main/config.json ) and will switch to the GPTQ backend as the default CUDA kernel once their issue is fixed. Therefore, I believe TGI should have little difficulty switching to a better option in the future for all the gptq based models.
I tried to export a model using the gptq format but it seems to be not marlin compatible. Could you specify how to export the marlin compatible weights ?
bits, group_size, sym = 8, 128, True
autoround = AutoRound(
model,
tokenizer,
bits=bits,
group_size=group_size,
sym=sym,
batch_size=8,
device="auto",
iters=1_0,
n_samples=1_0,
dataset="pL-Community/EduScorerGerman",
backend="gptq:marlin",
)
autoround.quantize()
output_dir = "./tmp_autoround"
autoround.save_quantized(output_dir, format="auto_gptq")
I tried to export a model using the gptq format but it seems to be not marlin compatible. Could you specify how to export the marlin compatible weights ?
Yes, sorry for the inconvenience. We will provide support within 1-2 days and keep you updated.
Since this backend is not yet supported in Transformers and only supports symmetric quantization, we haven't provided the API to users.
I just started a quantization with 1000 samples and iters Hopefully it can be loaded with TGI Marlin gptq support https://github.com/huggingface/text-generation-inference/pull/2111 Will report here
For debugging, use 32 samples with 2 iterations and disable_low_gpu_mem_usage for much faster performance. By default, we use 512 samples and 200 iterations and will support a fast config soon. Additionally, I think --sym needs to be added.
Currently, the packing format is Triton in AutoRound, the same as exllamav2. I'm not sure whether TGI supports the conversion between marlin with exllamav2, as they are different formats to my knowledge.
https://github.com/intel/auto-round/pull/168 with sym, I have verified marlin on opt-125m, will verify llama3 later
@flozi00 Hi, I have fixed the issue, please have a double check.
auto-gptq==0.7.1
text = "There is a girl who likes adventure,"
opt125m Transformers API: There is a girl who likes adventure, and she is a girl who likes adventure. I'm not sure if you're being sarcastic or not, but I'm pretty sure you're being sarcastic. I'm not sure if you're being sarcastic or not, but I'm pretty sure
opt125m AutoGPTQ marlin API: "There is a girl who likes adventure, and she is a girl who likes adventure. I'm not sure if you're being sarcastic or not, but I'm pretty sure you're being sarcastic. I'm not sure if you're being sarcastic or not, but I'm pretty sure"
LLAMA3-8B-Instruct Transformers API: There is a girl who likes adventure, and she is always ready to take on new challenges. She is a true adventurer at heart, and she loves to explore new places and try new things. She is also very brave and never backs down from a challenge, even if it seems scary or
LLAMA3-8B-Instruct AutoGPTQ marlin API: There is a girl who likes adventure, and she is always ready to take on new challenges. She is a true adventurer at heart, and she loves to explore new places and try new things. She is also very brave and never backs down from a challenge, even if it seems scary or
reference cmd
CUDA_VISIBLE_DEVICES=0 \
python3 main.py \
--model_name $model_name \
--nsamples 128 \
--seqlen 512 \
--sym \
--disable_low_gpu_mem_usage \
--disable_eval \
--deployment_device 'gpu' \
We can support exporting to the Marlin format directly if needed, due to the repacking process
That would make it a lot easier At the moment i still need to repack it after the export to be marlin tgi compatible
That would make it a lot easier At the moment i still need to repack it after the export to be marlin tgi compatible
Sure, we will support it tomorrow.
I can confirm that the Marlin Kernels for gptq in tgi are working with the exported Models from the Auto round Main branch
@flozi00
We have added support for packing directly to the AutoRound format in https://github.com/intel/auto-round/pull/172 by setting --deployment_device 'auto_round:marlin' in our latest update. This feature will be merged after extensive testing.
Regarding exporting to the AutoGPTQ format, we found that with the current AutoGPTQ API, it still conducts repacking even with marlin format. Therefore, we do not plan to support this format as exporting to ExLlamaV2 is more compatible.
test_result: LLama3-8B-Instruct: There is a girl who likes adventure, and she is always ready to take on new challenges. She is a free spirit, and she loves to explore new places and try new things. She is also very curious, and she loves to learn new things. She is a bit of a thrill
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hi, here is the INC team from Intel. Thank you for developing this amazing project.
Motivation
Our team has developed a new weight-only quantization algorithm called Auto-Round. It has achieved superior accuracy compared to GPTQ, AWQ, and OmniQuant across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our paper, GitHub repository, and Hugging Face low-bit quantization leaderboard.
We would like to contribute this quantization algorithm to TGI and enable users to:
1. Quantize Floating Model Using Auto-Round
Extend the current
quantize
API and addmethod
as a new argument to select different algorithms. Users can utilize it as follows:We propose two options to implement it:
Option 1: Adding Auto-Round as a New Python Dependency (Recommended)
Auto-Round is currently released as a pure Python binary. The option adds
auto-round
to TGI'srequirements_xx.txt
and calls Auto-Round's API to obtain the quantized model.Advantages:
Option 2: Porting All Source Code of Auto-Round into TGI
We are also willing to integrate all source code of Auto-Round directly into TGI.
Advantages:
Here is the overall calling flow for the these two options:
2. Perform Inference with an AutoRound-quantized Model.
We propose extending the current
text-generation-launcher
API to includeautoround
as a new option within--quantize
. Users can utilize it as follows:Your feedback is important. Please feel free to comment on the options mentioned above or suggest additional approaches to ensure the most appropriate way to contribute :). Thank you in advance!