Closed stas00 closed 2 months ago
So probably need to update the vllm example to use an example that actually works.
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
quant_path = 'mistral-instruct-v0.2-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')
I have filed a PR to fix the datasets
version there. https://github.com/casper-hansen/AutoAWQ/pull/593
and the example https://github.com/casper-hansen/AutoAWQ/pull/595
Can you post a PR with the change?
@stas00 AWQ is great BTW. However, if you have some high QPS workloads or offline workloads, I would suggest using activation quantization to get the best performance. With activation quantization, we can use the lower bit tensor cores which have 2x the FLOPs. This means we can accelerate the compute bound regime (which becomes the bottlenecks). AWQ 4 bit will still get the best possible latency for very low QPS regimes (e.g. QPS = 1) but outside of this, act quant will dominate.
Some benchmarks analyzing this result in this blog:
Here's some examples for how to make activation quantization models for vllm:
I figured this might be useful for you.
Can you post a PR with the change?
done: https://github.com/vllm-project/vllm/pull/7937
I'd love to experiment with your suggestions, Robert. Do I need to use your fork for that?
But first I need to figure out how to reliably measure performance so that I could measure the impact and currently as I reported here https://github.com/vllm-project/vllm/issues/7935 it doesn't scale using openAI client. What benchmarks do you use to compare performance of various quantization techniques?
Thank you!
Can you post a PR with the change?
done: #7937
I'd love to experiment with your suggestions, Robert. Do I need to use your fork for that?
But first I need to figure out how to reliably measure performance so that I could measure the impact and currently as I reported here #7935 it doesn't scale using openAI client. What benchmarks do you use to compare performance of various quantization techniques?
Thank you!
Nope, you do not need the fork. These methods are all supported in vLLM.
re: OpenAI performance. Nick and I are working on it
📚 The doc issue
The quantization example at https://docs.vllm.ai/en/latest/quantization/auto_awq.html can't be run - it looks like AWQ is looking for safetensors files and https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main doesn't have them.
autoawq=0.2.6
Suggest a potential alternative/fix
I tried another model that has .safetensors files but then it fails with:
I see that this example has been copied from https://github.com/casper-hansen/AutoAWQ?tab=readme-ov-file#examples and it's identical and broken at the source.
edit: I think the issue is the
datasets
version - I'm able to run this version https://github.com/casper-hansen/AutoAWQ/blob/6f14fc7436d9a3fb5fc69299e4eb37db4ee9c891/examples/quantize.py withdatasets==2.21.0
the version from https://docs.vllm.ai/en/latest/quantization/auto_awq.html still fails as explained above.