vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.04k stars 4.54k forks source link

[Bug]: RuntimeError in gptq_marlin_24_gemm #8654

Open leoyuppieqnew opened 1 month ago

leoyuppieqnew commented 1 month ago

Your current environment

python 3.8 L20*4 vllm 0.5.4

Model Input Dumps

No response

🐛 Describe the bug

$python -m vllm.entrypoints.api_server --model='/mntfn/yanyi/Qwen2-7B-Instruct_24_w4a16/stage_quantization' --max-model-len=16000 --tensor-parallel-size=4 --use-v2-block-manager --enable-prefix-caching

rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args rank0: engine = cls( rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 631, in init rank0: self.engine = self._init_engine(*args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 830, in _init_engine rank0: return engine_class(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 267, in init rank0: super().init(args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 283, in init

rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 389, in _initialize_kv_caches

rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks rank0: num_blocks = self._run_workers("determine_num_available_blocks", ) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers rank0: driver_worker_output = driver_worker_method(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(args, **kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/worker/worker.py", line 195, in determine_num_available_blocks

rank0: File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(*args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 1110, in profile_run rank0: self.execute_model(model_input, kv_caches, intermediate_tensors) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 1539, in execute_model rank0: hidden_or_intermediate_states = model_executable( rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 361, in forward rank0: hidden_states = self.model(input_ids, positions, kv_caches, rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 277, in forward rank0: hidden_states, residual = layer( rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 210, in forward rank0: hidden_states = self.self_attn( rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/modelexecutor/models/qwen2.py", line 154, in forward rank0: qkv, = self.qkv_proj(hidden_states) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/model_executor/layers/linear.py", line 359, in forward rank0: output_parallel = self.quantmethod.apply(self, input, bias) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 358, in apply rank0: return scheme.apply_weights(layer, x, bias=bias) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a16_24.py", line 144, in apply_weights rank0: output_2d = ops.gptq_marlin_24_gemm(x_2d, qweight, meta, scales, rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/_custom_ops.py", line 28, in wrapper rank0: return fn(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/_custom_ops.py", line 222, in gptq_marlin_24_gemm rank0: return torch.ops._C.gptq_marlin_24_gemm(a, b_q_weight, b_meta, b_scales, rank0: File "/opt/conda/lib/python3.8/site-packages/torch/ops.py", line 1061, in call rank0: return self._op(args, **(kwargs or {})) rank0: RuntimeError: prob_m = 1152 is not divisible by thread_m = 512

Before submitting a new issue...

youkaichao commented 1 month ago

cc @mgoin

mgoin commented 1 month ago

I'm sorry to say but I believe this is an intentional limitation of the sparse marlin kernels for performance. Please try using a smaller TP configuration to split up the model's layers less - for instance 1152*4/512 = 9 so it should work for TP=1

leoyuppieqnew commented 1 month ago

I'm sorry to say but I believe this is an intentional limitation of the sparse marlin kernels for performance. Please try using a smaller TP configuration to split up the model's layers less - for instance 1152*4/512 = 9 so it should work for TP=1

Thank you for your reply. I tried tp=1 and it did not report an error, but the output was very strange. Its outputs are all "!". What's going on?

mgoin commented 1 month ago

I assume you've made the model yourself since we haven't produced a 2:4 version of Qwen2, which means you probably pruned the model in one shot. It is entirely possible that the model has lost too much accuracy in that case. When we (Neural Magic) produce 2:4 models we usually perform a type of training-aware pruning+retraining. If you haven't evaluated the model before deploying with vLLM, I would recommend doing so.

leoyuppieqnew commented 1 month ago

I assume you've made the model yourself since we haven't produced a 2:4 version of Qwen2, which means you probably pruned the model in one shot. It is entirely possible that the model has lost too much accuracy in that case. When we (Neural Magic) produce 2:4 models we usually perform a type of training-aware pruning+retraining. If you haven't evaluated the model before deploying with vLLM, I would recommend doing so.

Yes, I used llm-compressor to process the sparse model myself. Maybe I should try retraining it. Thanks~