[Doc]: AutoAWQ quantization example fails

stas00 commented 2 months ago

📚 The doc issue

The quantization example at https://docs.vllm.ai/en/latest/quantization/auto_awq.html can't be run - it looks like AWQ is looking for safetensors files and https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main doesn't have them.

    return model_class.from_pretrained(
  File "/env/lib/conda/stas-inference/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3477, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named model.safetensors found in directory /data/huggingface/hub/models--lmsys--vicuna-7b-v1.5/snapshots/3321f76e3f527bd14065daf69dad9344000a201d.

autoawq=0.2.6

Suggest a potential alternative/fix

I tried another model that has .safetensors files but then it fails with:

  File "/env/lib/conda/stas-inference/lib/python3.10/site-packages/datasets/data_files.py", line 332, in resolve_pattern
    fs, _, _ = get_fs_token_paths(pattern, storage_options=storage_options)
  File "/env/lib/conda/stas-inference/lib/python3.10/site-packages/fsspec/core.py", line 681, in get_fs_token_paths
    paths = [f for f in sorted(fs.glob(paths)) if not fs.isdir(f)]
  File "/env/lib/conda/stas-inference/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 417, in glob
    return super().glob(path, **kwargs)
  File "/env/lib/conda/stas-inference/lib/python3.10/site-packages/fsspec/spec.py", line 613, in glob
    pattern = glob_translate(path + ("/" if ends_with_sep else ""))
  File "/env/lib/conda/stas-inference/lib/python3.10/site-packages/fsspec/utils.py", line 732, in glob_translate
    raise ValueError(
ValueError: Invalid pattern: '**' can only be an entire path component

I see that this example has been copied from https://github.com/casper-hansen/AutoAWQ?tab=readme-ov-file#examples and it's identical and broken at the source.

edit: I think the issue is the datasets version - I'm able to run this version https://github.com/casper-hansen/AutoAWQ/blob/6f14fc7436d9a3fb5fc69299e4eb37db4ee9c891/examples/quantize.py with datasets==2.21.0

the version from https://docs.vllm.ai/en/latest/quantization/auto_awq.html still fails as explained above.

stas00 commented 2 months ago

So probably need to update the vllm example to use an example that actually works.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
quant_path = 'mistral-instruct-v0.2-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

I have filed a PR to fix the datasets version there. https://github.com/casper-hansen/AutoAWQ/pull/593 and the example https://github.com/casper-hansen/AutoAWQ/pull/595

robertgshaw2-neuralmagic commented 2 months ago

Can you post a PR with the change?

robertgshaw2-neuralmagic commented 2 months ago

@stas00 AWQ is great BTW. However, if you have some high QPS workloads or offline workloads, I would suggest using activation quantization to get the best performance. With activation quantization, we can use the lower bit tensor cores which have 2x the FLOPs. This means we can accelerate the compute bound regime (which becomes the bottlenecks). AWQ 4 bit will still get the best possible latency for very low QPS regimes (e.g. QPS = 1) but outside of this, act quant will dominate.

Some benchmarks analyzing this result in this blog:

https://neuralmagic.com/blog/llm-compressor-is-here-faster-inference-with-vllm/

Here's some examples for how to make activation quantization models for vllm:

INT8 (use this on Ampere): https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_int8
FP8 (us this on Hopper): https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_int8

I figured this might be useful for you.

stas00 commented 2 months ago

Can you post a PR with the change?

done: https://github.com/vllm-project/vllm/pull/7937

I'd love to experiment with your suggestions, Robert. Do I need to use your fork for that?

But first I need to figure out how to reliably measure performance so that I could measure the impact and currently as I reported here https://github.com/vllm-project/vllm/issues/7935 it doesn't scale using openAI client. What benchmarks do you use to compare performance of various quantization techniques?

Thank you!

robertgshaw2-neuralmagic commented 2 months ago

Can you post a PR with the change?

done: #7937

I'd love to experiment with your suggestions, Robert. Do I need to use your fork for that?

But first I need to figure out how to reliably measure performance so that I could measure the impact and currently as I reported here #7935 it doesn't scale using openAI client. What benchmarks do you use to compare performance of various quantization techniques?

Thank you!

Nope, you do not need the fork. These methods are all supported in vLLM.

re: OpenAI performance. Nick and I are working on it

vllm-project / vllm

[Doc]: AutoAWQ quantization example fails #7717

📚 The doc issue

Suggest a potential alternative/fix