vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.63k stars 4.07k forks source link

KeyError: 'quant_method' #1912

Closed jim1997 closed 6 months ago

jim1997 commented 10 months ago

I was running the following command to start an api server using predownloaded Baichuan model: python3 -m vllm.entrypoints.openai.api_server --model ./baichuan-inc/Baichuan2-13B-Chat-4bits --trust-remote-code

However ,this error occurred:

INFO 12-04 01:53:01 api_server.py:638] args: Namespace(allow_credentials=False, allowed_headers=[''], allowed_methods=[''], allowed_origins=['*'], block_size=16, disable_log_requests=False, disable_log_stats=False, download_dir=None, dtype='auto', engine_use_ray=False, gpu_memory_utilization=0.9, host=None, load_format='auto', max_log_len=None, max_model_len=None, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, model='./baichuan-inc/Baichuan2-13B-Chat-4bits', pipeline_parallel_size=1, port=8000, quantization=None, revision=None, seed=0, served_model_name=None, swap_space=4, tensor_parallel_size=1, tokenizer=None, tokenizer_mode='auto', tokenizer_revision=None, trust_remote_code=True, worker_use_ray=False) Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/openai/api_server.py", line 646, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 480, in from_engine_args engine_configs = engine_args.create_engine_configs() File "/usr/local/lib/python3.8/dist-packages/vllm/engine/arg_utils.py", line 187, in create_engine_configs model_config = ModelConfig(self.model, self.tokenizer, File "/usr/local/lib/python3.8/dist-packages/vllm/config.py", line 97, in init self._verify_quantization() File "/usr/local/lib/python3.8/dist-packages/vllm/config.py", line 125, in _verify_quantization hf_quant_method = str(hf_quant_config["quant_method"]).lower() KeyError: 'quant_method'

Could anyone tell me how to fix this?

liudaotan commented 10 months ago

Same question here, anybody can help?

David-Lee-1990 commented 9 months ago

me too

dltraveler commented 9 months ago

LLM model parameter quantization: The method used to quantize the model weights. Currently, we support "awq". If None, we assume the model weights are not quantized and use dtype to determine the data type of the weights.

jim1997 commented 9 months ago

LLM model parameter quantization: The method used to quantize the model weights. Currently, we support "awq". If None, we assume the model weights are not quantized and use dtype to determine the data type of the weights.

Got it! Thanks for the reply!