Open zuosong-peng opened 8 months ago
@chooper1 Can you check this? Thanks. @SoleMY shows using SqueezeLLM modified transformer code can load and run SqueezeLLM quantized and but vllm is failing at load stage.
I see the same error in another model: KeyError: 'model.layers.0.self_attn.qkv_proj.rows'
It's actually located here: 'model.layers.0.self_attn.qkv_proj.qweight'
@catid Did you get any model to run with SqueezeLLM/VLLM? I believe this feature was never tested post-merge with VLLM and should be removed.
It's actually located here: 'model.layers.0.self_attn.qkv_proj.qweight'
if you use this key model.layers.0.self_attn.qkv_proj.qweight, it definitely report this bug
rank0: Traceback (most recent call last):
rank0: File "/home/ryan/vllm/benchmarks/benchmark_latency.py", line 195, in
rank0: File "/home/ryan/vllm/benchmarks/benchmark_latency.py", line 20, in main rank0: llm = LLM(model=args.model,
rank0: File "/home/ryan/vllm/vllm/entrypoints/llm.py", line 123, in init rank0: self.llm_engine = LLMEngine.from_engine_args(
rank0: File "/home/ryan/vllm/vllm/engine/llm_engine.py", line 292, in from_engine_args rank0: engine = cls(
rank0: File "/home/ryan/vllm/vllm/engine/llm_engine.py", line 160, in init rank0: self.model_executor = executor_class(
rank0: File "/home/ryan/vllm/vllm/executor/executor_base.py", line 41, in init
rank0: File "/home/ryan/vllm/vllm/executor/gpu_executor.py", line 23, in _init_executor
rank0: File "/home/ryan/vllm/vllm/executor/gpu_executor.py", line 69, in _init_non_spec_worker
rank0: File "/home/ryan/vllm/vllm/worker/worker.py", line 118, in load_model
rank0: File "/home/ryan/vllm/vllm/worker/model_runner.py", line 164, in load_model rank0: self.model = get_model(
rank0: File "/home/ryan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model rank0: return loader.load_model(model_config=model_config,
rank0: File "/home/ryan/vllm/vllm/model_executor/model_loader/loader.py", line 224, in load_model
rank0: File "/home/ryan/vllm/vllm/model_executor/models/llama.py", line 412, in load_weights rank0: weight_loader(param, loaded_weight, shard_id) rank0: File "/home/ryan/vllm/vllm/model_executor/layers/linear.py", line 561, in weight_loader rank0: loaded_weight = loaded_weight.narrow(output_dim, start_idx,
rank0: IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1) how do you fix it?
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This is my env version:
I use SqueezeLLM quantization my llama-7B trained model and want use vllm load, below is my code and traceback
AutoModelForCausalLM can load SqueezeLLM model successfully
But vllm failed to load with error
Stacktrace