Closed jim1997 closed 6 months ago
Same question here, anybody can help?
me too
LLM model parameter quantization: The method used to quantize the model weights. Currently,
we support "awq". If None, we assume the model weights are not
quantized and use dtype
to determine the data type of the weights.
LLM model parameter quantization: The method used to quantize the model weights. Currently, we support "awq". If None, we assume the model weights are not quantized and use
dtype
to determine the data type of the weights.
Got it! Thanks for the reply!
I was running the following command to start an api server using predownloaded Baichuan model: python3 -m vllm.entrypoints.openai.api_server --model ./baichuan-inc/Baichuan2-13B-Chat-4bits --trust-remote-code
However ,this error occurred:
INFO 12-04 01:53:01 api_server.py:638] args: Namespace(allow_credentials=False, allowed_headers=[''], allowed_methods=[''], allowed_origins=['*'], block_size=16, disable_log_requests=False, disable_log_stats=False, download_dir=None, dtype='auto', engine_use_ray=False, gpu_memory_utilization=0.9, host=None, load_format='auto', max_log_len=None, max_model_len=None, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, model='./baichuan-inc/Baichuan2-13B-Chat-4bits', pipeline_parallel_size=1, port=8000, quantization=None, revision=None, seed=0, served_model_name=None, swap_space=4, tensor_parallel_size=1, tokenizer=None, tokenizer_mode='auto', tokenizer_revision=None, trust_remote_code=True, worker_use_ray=False) Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/openai/api_server.py", line 646, in
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 480, in from_engine_args
engine_configs = engine_args.create_engine_configs()
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/arg_utils.py", line 187, in create_engine_configs
model_config = ModelConfig(self.model, self.tokenizer,
File "/usr/local/lib/python3.8/dist-packages/vllm/config.py", line 97, in init
self._verify_quantization()
File "/usr/local/lib/python3.8/dist-packages/vllm/config.py", line 125, in _verify_quantization
hf_quant_method = str(hf_quant_config["quant_method"]).lower()
KeyError: 'quant_method'
Could anyone tell me how to fix this?