Closed shatealaboxiaowang closed 6 months ago
Codellama-13B looks like not support yet, please check the vllm README to ensure what architectures was supported
You can try our PR branch for running sq llama in vllm. https://github.com/vllm-project/vllm/pull/1508
Quantisation is supported via GPTQ, AWQ, SqueezeLLM
Hi, dear:
I have completed the conversion and export of the model format by smoothquant, but when I use vllm to load the model and do inference, the error is as follows:
INFO 12-05 09:00:58 tokenizer.py:32] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer. Traceback (most recent call last): File "./vllm/entrypoints/api_server.py", line 80, in
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 495, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 269, in init
self.engine = self._init_engine(*args, kwargs)
File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine
return engine_class(*args, *kwargs)
File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 109, in init
self._init_workers(distributed_init_method)
File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 145, in _init_workers
self._run_workers(
File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 750, in _run_workers
self._run_workers_in_batch(workers, method, args, kwargs))
File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 724, in _run_workers_in_batch
output = executor(*args, **kwargs)
File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/worker/worker.py", line 72, in load_model
self.model_runner.load_model()
File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 36, in load_model
self.model = get_model(self.model_config)
File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/model_executor/model_loader.py", line 98, in get_model
model.load_weights(model_config.model, model_config.download_dir,
File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 328, in load_weights
param = params_dict[name.replace(weight_name, param_name)]
KeyError: 'model.layers.0.self_attn.qkv_proj.bias'
What is the reason?