Please help me solve the problem. thanks

CP3666 commented 9 months ago

(songdh) [root@localhost server_llm]# python -m vllm.entrypoints.api_server --model $model_path --tokenizer $model_path --tensor-parallel-size $GPUS --dtype auto --port $port --host 0.0.0.0 --gpu-memory-utilization 0.9 --quantization awq --dtype float16 --load-format auto & [1] 86402 (songdh) [root@localhost server_llm]# WARNING 11-25 10:47:23 config.py:398] Casting torch.bfloat16 to torch.float16. WARNING 11-25 10:47:23 config.py:140] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 11-25 10:47:23 llm_engine.py:72] Initializing an LLM engine with config: model='/data5/llama/models_hf/13B', tokenizer='/data5/llama/models_hf/13B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0) INFO 11-25 10:47:23 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer. Traceback (most recent call last): File "/root/anaconda3/envs/songdh/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/anaconda3/envs/songdh/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/entrypoints/api_server.py", line 80, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 486, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 269, in init self.engine = self._init_engine(*args, kwargs) File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 305, in _init_engine return engine_class(*args, *kwargs) File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 110, in init self._init_workers(distributed_init_method) File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 142, in _init_workers self._run_workers( File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 700, in _run_workers output = executor(args, kwargs) File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/worker/worker.py", line 70, in init_model self.model = get_model(self.model_config) File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 67, in get_model quant_config = get_quant_config(model_config.quantization, File "/root/anaconda3/envs/songdh/lib/python3.10/site-packages/vllm/model_executor/weight_utils.py", line 114, in get_quant_config raise ValueError(f"Cannot find the config file for {quantization}") ValueError: Cannot find the config file for awq

777ki commented 9 months ago

try --tokenizer-mode=slow?

kasoushu commented 9 months ago

same

ghost commented 9 months ago

You'd have to specify what model you are trying to load. Maybe the repo doesn't contain the quant_config.json file?

acodercat commented 7 months ago

--max-model-len 2048

hmellor commented 3 months ago

vLLM does not quantize models for you. If the model you are trying to load isn't quantised then it won't work. This appears to be what is happening.

vllm-project / vllm

Please help me solve the problem. thanks #1784