However, when I try same script with neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 it raises an error. Only difference that I see in the llm server logs is model config with quantization=compressed-tensors
Exception in worker VllmWorkerProcess while processing method load_model: , Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
output = executor(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_worker.py", line 217, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 125, in load_model
self.model = get_model(model_config=self.model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 341, in load_model
model = _initialize_model(model_config, self.load_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 174, in _initialize_model
quant_config=_get_quantization_config(model_config, load_config),
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 98, in _get_quantization_config
capability = current_platform.get_device_capability()
File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/interface.py", line 28, in get_device_capability
raise NotImplementedError
NotImplementedError
Expected behavior
Run vLLM Openai compatible server, same as uncompressed original Llama 3.1 8B model
Environment
Include all relevant environment information:
OS: Ubuntu 22.04.1
Python version: 3.10.12
ML framework version(s): torch 2.4.0+cpu
Other Python package versions: vLLM 0.5.5+cpu,
To Reproduce
Start Ubuntu 22.04 Docker container with official image
Install vLLM CPU backend as in the docs, build from source
Describe the bug When I start vLLM OpenAI server with Meta Llama 3.1 8B instruct, it works.
However, when I try same script with
neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8
it raises an error. Only difference that I see in the llm server logs is model config withquantization=compressed-tensors
Expected behavior Run vLLM Openai compatible server, same as uncompressed original Llama 3.1 8B model
Environment Include all relevant environment information:
To Reproduce