vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
405 stars 29 forks source link

QWen2-72b error while starting vllm server, Weight input_size_per_partition = 7392 is not divisible by min_thread_k = 128 #57

Closed hayleyhu closed 1 month ago

hayleyhu commented 1 month ago

Describe the bug Can not server Qwen2-72b.W8A16 compressed model with vllm server

Expected behavior The model can be served in vllm server.

Environment Include all relevant environment information:

  1. OS [e.g. Ubuntu 18.04]: linux
  2. Python version [e.g. 3.7]: 3.12
  3. LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: I used https://huggingface.co/neuralmagic/Qwen2-72B-Instruct-quantized.w8a16
  4. ML framework version(s) [e.g. torch 1.7.1]: vllm 0.5.1post
  5. Other Python package versions [e.g. SparseZoo, DeepSparse, numpy, ONNX]:
  6. Other relevant environment information [e.g. hardware, CUDA version]: A100 80G

To Reproduce Exact steps to reproduce the behavior: clone th evllm repo

docker build -t vllm:tests -f Dockerfile .
docker run --gpus=all --rm --gpus '"device=1,2,3,4"' --shm-size 2g \
    -v /inworld:/inworld   \
    -p 8000:8000\
    --ipc=host \
    vllm:tests \
    --model neuralmagic/Qwen2-72B-Instruct-quantized.w8a16 \
    --max-model-len 8192 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.98 \
    --served-model-name /model/llm \
    --enable-prefix-caching \
    --max-num-seqs 4 \
    --max-num-batched-tokens 32768

Errors If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: Weight input_size_per_partition = 7392 is not divisible by min_thread_k = 128., Traceback (most recent call last):
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 122, in load_model
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.model_runner.load_model()
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 148, in load_model
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.model = get_model(
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     return loader.load_model(model_config=model_config,
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 279, in load_model
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     model = _initialize_model(model_config, self.load_config,
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 116, in _initialize_model
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     return model_class(config=model_config.hf_config,
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 311, in __init__
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.model = Qwen2Model(config, cache_config, quant_config)
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 237, in __init__
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.layers = nn.ModuleList([
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 238, in <listcomp>
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     Qwen2DecoderLayer(config, cache_config, quant_config)
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 180, in __init__
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.mlp = Qwen2MLP(
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 65, in __init__
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.down_proj = RowParallelLinear(intermediate_size,
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 744, in __init__
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.quant_method.create_weights(
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 226, in create_weights
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     raise ValueError(
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226] ValueError: Weight input_size_per_partition = 7392 is not divisible by min_thread_k = 128.
(VllmWorkerProcess pid=14678) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: Weight input_size_per_partition = 7392 is not divisible by min_thread_k = 128., Traceback (most recent call last):
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 122, in load_model
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.model_runner.load_model()
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 148, in load_model
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.model = get_model(
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     return loader.load_model(model_config=model_config,
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 279, in load_model
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     model = _initialize_model(model_config, self.load_config,
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 116, in _initialize_model
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     return model_class(config=model_config.hf_config,
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 311, in __init__
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.model = Qwen2Model(config, cache_config, quant_config)
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 237, in __init__
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.layers = nn.ModuleList([
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 238, in <listcomp>
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     Qwen2DecoderLayer(config, cache_config, quant_config)
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 180, in __init__
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.mlp = Qwen2MLP(
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 65, in __init__
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.down_proj = RowParallelLinear(intermediate_size,
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 744, in __init__
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]     self.quant_method.create_weights(
(VllmWorkerProcess pid=14679) ERROR 08-05 23:29:38 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 226, in create_weights

After I change GPTQ_MARLIN_MIN_THREAD_K = 32 to gptq_marlin.py, I got a different error

(VllmWorkerProcess pid=14679) ERROR 08-05 23:58:29 multiproc_worker_utils.py:226]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 214, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 236, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
[rank0]:     num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 124, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 844, in profile_run
[rank0]:     self.execute_model(seqs, kv_caches)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
[rank0]:     hidden_states = model_executable(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 330, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 254, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 216, in forward
[rank0]:     hidden_states = self.mlp(hidden_states)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 77, in forward
[rank0]:     x, _ = self.down_proj(x)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 804, in forward
[rank0]:     output_parallel = self.quant_method.apply(self, input_parallel)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 440, in apply
[rank0]:     output = ops.gptq_marlin_gemm(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 34, in wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 257, in gptq_marlin_gemm
[rank0]:     return torch.ops._C.gptq_marlin_gemm(a, b_q_weight, b_scales, g_idx, perm,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854, in __call__
[rank0]:     return self_._op(*args, **(kwargs or {}))
[rank0]: RuntimeError: Invalid thread config: max_m_blocks = 0, thread_k = -1, thread_n = -1, num_threads = -1 for MKN = [32768, 7392, 8192] and num_bits = 8, group_size = -1, has_act_order = 0, is_k_full = 1, max_shared_mem = 166912

Additional context Add any other context about the problem here. Also include any relevant files.

yzlnew commented 1 month ago

Related issue: https://github.com/vllm-project/llm-compressor/issues/54

hayleyhu commented 1 month ago

I have tried changing every thread_k = 128 to 32, but still failed starting the server

robertgshaw2-neuralmagic commented 1 month ago

The marlin kernel has some limitations on the shapes that it can support.

Unfortunately the shapes of the Qwen matrices are not a power of two, which is unfortunate. For this model, --tensor-parallel-size 2 is the maximum tp size that can run with marlin.

We will:

robertgshaw2-neuralmagic commented 1 month ago

Closing the issue - for kernel support requests, please open an issue in vllm-project/vllm