vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.09k stars 3.98k forks source link

[Bug]: RuntimeError in gptq_marlin_24_gemm #8654

Open leoyuppieqnew opened 1 day ago

leoyuppieqnew commented 1 day ago

Your current environment

python 3.8 L20*4 vllm 0.5.4

Model Input Dumps

No response

🐛 Describe the bug

$python -m vllm.entrypoints.api_server --model='/mntfn/yanyi/Qwen2-7B-Instruct_24_w4a16/stage_quantization' --max-model-len=16000 --tensor-parallel-size=4 --use-v2-block-manager --enable-prefix-caching

rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args rank0: engine = cls( rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 631, in init rank0: self.engine = self._init_engine(*args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 830, in _init_engine rank0: return engine_class(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 267, in init rank0: super().init(args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 283, in init

rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 389, in _initialize_kv_caches

rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks rank0: num_blocks = self._run_workers("determine_num_available_blocks", ) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers rank0: driver_worker_output = driver_worker_method(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(args, **kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/worker/worker.py", line 195, in determine_num_available_blocks

rank0: File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(*args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 1110, in profile_run rank0: self.execute_model(model_input, kv_caches, intermediate_tensors) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 1539, in execute_model rank0: hidden_or_intermediate_states = model_executable( rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 361, in forward rank0: hidden_states = self.model(input_ids, positions, kv_caches, rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 277, in forward rank0: hidden_states, residual = layer( rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 210, in forward rank0: hidden_states = self.self_attn( rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/modelexecutor/models/qwen2.py", line 154, in forward rank0: qkv, = self.qkv_proj(hidden_states) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/model_executor/layers/linear.py", line 359, in forward rank0: output_parallel = self.quantmethod.apply(self, input, bias) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 358, in apply rank0: return scheme.apply_weights(layer, x, bias=bias) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a16_24.py", line 144, in apply_weights rank0: output_2d = ops.gptq_marlin_24_gemm(x_2d, qweight, meta, scales, rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/_custom_ops.py", line 28, in wrapper rank0: return fn(*args, *kwargs) rank0: File "/opt/conda/lib/python3.8/site-packages/vllm/_custom_ops.py", line 222, in gptq_marlin_24_gemm rank0: return torch.ops._C.gptq_marlin_24_gemm(a, b_q_weight, b_meta, b_scales, rank0: File "/opt/conda/lib/python3.8/site-packages/torch/ops.py", line 1061, in call rank0: return self._op(args, **(kwargs or {})) rank0: RuntimeError: prob_m = 1152 is not divisible by thread_m = 512

Before submitting a new issue...

youkaichao commented 15 hours ago

cc @mgoin

mgoin commented 15 hours ago

I'm sorry to say but I believe this is an intentional limitation of the sparse marlin kernels for performance. Please try using a smaller TP configuration to split up the model's layers less - for instance 1152*4/512 = 9 so it should work for TP=1