Open zuosong-peng opened 4 months ago
@chooper1 Can you check this? Thanks. @SoleMY shows using SqueezeLLM modified transformer code can load and run SqueezeLLM quantized and but vllm is failing at load stage.
I see the same error in another model: KeyError: 'model.layers.0.self_attn.qkv_proj.rows'
It's actually located here: 'model.layers.0.self_attn.qkv_proj.qweight'
@catid Did you get any model to run with SqueezeLLM/VLLM? I believe this feature was never tested post-merge with VLLM and should be removed.
It's actually located here: 'model.layers.0.self_attn.qkv_proj.qweight'
if you use this key model.layers.0.self_attn.qkv_proj.qweight, it definitely report this bug
rank0: Traceback (most recent call last):
rank0: File "/home/ryan/vllm/benchmarks/benchmark_latency.py", line 195, in
rank0: File "/home/ryan/vllm/benchmarks/benchmark_latency.py", line 20, in main rank0: llm = LLM(model=args.model,
rank0: File "/home/ryan/vllm/vllm/entrypoints/llm.py", line 123, in init rank0: self.llm_engine = LLMEngine.from_engine_args(
rank0: File "/home/ryan/vllm/vllm/engine/llm_engine.py", line 292, in from_engine_args rank0: engine = cls(
rank0: File "/home/ryan/vllm/vllm/engine/llm_engine.py", line 160, in init rank0: self.model_executor = executor_class(
rank0: File "/home/ryan/vllm/vllm/executor/executor_base.py", line 41, in init
rank0: File "/home/ryan/vllm/vllm/executor/gpu_executor.py", line 23, in _init_executor
rank0: File "/home/ryan/vllm/vllm/executor/gpu_executor.py", line 69, in _init_non_spec_worker
rank0: File "/home/ryan/vllm/vllm/worker/worker.py", line 118, in load_model
rank0: File "/home/ryan/vllm/vllm/worker/model_runner.py", line 164, in load_model rank0: self.model = get_model(
rank0: File "/home/ryan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model rank0: return loader.load_model(model_config=model_config,
rank0: File "/home/ryan/vllm/vllm/model_executor/model_loader/loader.py", line 224, in load_model
rank0: File "/home/ryan/vllm/vllm/model_executor/models/llama.py", line 412, in load_weights rank0: weight_loader(param, loaded_weight, shard_id) rank0: File "/home/ryan/vllm/vllm/model_executor/layers/linear.py", line 561, in weight_loader rank0: loaded_weight = loaded_weight.narrow(output_dim, start_idx,
rank0: IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1) how do you fix it?
This is my env version:
I use SqueezeLLM quantization my llama-7B trained model and want use vllm load, below is my code and traceback
AutoModelForCausalLM can load SqueezeLLM model successfully
But vllm failed to load with error
Stacktrace