vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.96k stars 4.13k forks source link

Support loading Int8-quantified (smoothquant) Codellama-13B ? #1931

Closed shatealaboxiaowang closed 6 months ago

shatealaboxiaowang commented 10 months ago

Hi, dear:

I have completed the conversion and export of the model format by smoothquant, but when I use vllm to load the model and do inference, the error is as follows:

INFO 12-05 09:00:58 tokenizer.py:32] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer. Traceback (most recent call last): File "./vllm/entrypoints/api_server.py", line 80, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 495, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 269, in init self.engine = self._init_engine(*args, kwargs) File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine return engine_class(*args, *kwargs) File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 109, in init self._init_workers(distributed_init_method) File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 145, in _init_workers self._run_workers( File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 750, in _run_workers self._run_workers_in_batch(workers, method, args, kwargs)) File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 724, in _run_workers_in_batch output = executor(*args, **kwargs) File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/worker/worker.py", line 72, in load_model self.model_runner.load_model() File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 36, in load_model self.model = get_model(self.model_config) File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/model_executor/model_loader.py", line 98, in get_model model.load_weights(model_config.model, model_config.download_dir, File "/home/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 328, in load_weights param = params_dict[name.replace(weight_name, param_name)] KeyError: 'model.layers.0.self_attn.qkv_proj.bias'

What is the reason?

777ki commented 10 months ago

Codellama-13B looks like not support yet, please check the vllm README to ensure what architectures was supported

AniZpZ commented 10 months ago

You can try our PR branch for running sq llama in vllm. https://github.com/vllm-project/vllm/pull/1508

hmellor commented 6 months ago

Quantisation is supported via GPTQ, AWQ, SqueezeLLM