Closed CP3666 closed 3 months ago
try --tokenizer-mode=slow?
same
You'd have to specify what model you are trying to load.
Maybe the repo doesn't contain the quant_config.json
file?
--max-model-len 2048
vLLM does not quantize models for you. If the model you are trying to load isn't quantised then it won't work. This appears to be what is happening.
(songdh) [root@localhost server_llm]# python -m vllm.entrypoints.api_server --model $model_path --tokenizer $model_path --tensor-parallel-size $GPUS --dtype auto --port $port --host 0.0.0.0 --gpu-memory-utilization 0.9 --quantization awq --dtype float16 --load-format auto & [1] 86402 (songdh) [root@localhost server_llm]# WARNING 11-25 10:47:23 config.py:398] Casting torch.bfloat16 to torch.float16. WARNING 11-25 10:47:23 config.py:140] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 11-25 10:47:23 llm_engine.py:72] Initializing an LLM engine with config: model='/data5/llama/models_hf/13B', tokenizer='/data5/llama/models_hf/13B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0) INFO 11-25 10:47:23 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer. Traceback (most recent call last): File "/root/anaconda3/envs/songdh/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/anaconda3/envs/songdh/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/entrypoints/api_server.py", line 80, in
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 486, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 269, in init
self.engine = self._init_engine(*args, kwargs)
File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 305, in _init_engine
return engine_class(*args, *kwargs)
File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 110, in init
self._init_workers(distributed_init_method)
File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 142, in _init_workers
self._run_workers(
File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 700, in _run_workers
output = executor(args, kwargs)
File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/worker/worker.py", line 70, in init_model
self.model = get_model(self.model_config)
File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 67, in get_model
quant_config = get_quant_config(model_config.quantization,
File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/model_executor/weight_utils.py", line 114, in get_quant_config
raise ValueError(f"Cannot find the config file for {quantization}")
ValueError: Cannot find the config file for awq