Using Quantized Models with vLLM CPU Backend

miracatici commented 2 weeks ago

Describe the bug When I start vLLM OpenAI server with Meta Llama 3.1 8B instruct, it works.

python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct

However, when I try same script with neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 it raises an error. Only difference that I see in the llm server logs is model config with quantization=compressed-tensors

python3 -m vllm.entrypoints.openai.api_server --model neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8

Exception in worker VllmWorkerProcess while processing method load_model: , Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
output = executor(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_worker.py", line 217, in load_model
self.model_runner.load_model()
    File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 125, in load_model
self.model = get_model(model_config=self.model_config,
    File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
    File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 341, in load_model
model = _initialize_model(model_config, self.load_config,
    File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 174, in _initialize_model
quant_config=_get_quantization_config(model_config, load_config),
    File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 98, in _get_quantization_config
capability = current_platform.get_device_capability()
    File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/interface.py", line 28, in get_device_capability
raise NotImplementedError
    NotImplementedError

Expected behavior Run vLLM Openai compatible server, same as uncompressed original Llama 3.1 8B model

Environment Include all relevant environment information:

OS: Ubuntu 22.04.1
Python version: 3.10.12
ML framework version(s): torch 2.4.0+cpu
Other Python package versions: vLLM 0.5.5+cpu,

To Reproduce

Start Ubuntu 22.04 Docker container with official image
Install vLLM CPU backend as in the docs, build from source
Run script

robertgshaw2-neuralmagic commented 1 week ago

There is an active PR from Intel to support this: https://github.com/vllm-project/vllm/pull/7257

robertgshaw2-neuralmagic commented 3 days ago

PR is merged!

vllm-project / llm-compressor

Using Quantized Models with vLLM CPU Backend #134