vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.94k stars 4.52k forks source link

ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.[Bug]: #5675

Closed QuanhuiGuan closed 3 months ago

QuanhuiGuan commented 4 months ago

@youkaichao

Your current environment

My environment: Name: vllm Version: 0.4.2+cu117

🐛 Describe the bug

I quantified the model(Qwen2_72B) using AWQ myself, when i wanna to set api service by using two gpus it doesn't work. but using one gpu is fine, but in some situation i have to use two gpus to set this service.

Is there anyone could give me some suggestions?

1718781017940

mgoin commented 4 months ago

@QuanhuiGuan Can you share the config.json with the quantization section here? Maybe there is a groupsize issue that is causing uneven sharding

QuanhuiGuan commented 4 months ago

@QuanhuiGuan Can you share the config.json with the quantization section here? Maybe there is a groupsize issue that is causing uneven sharding

here is the config.json file content, i used the offcial AWQ code to quantified my model. Could you give me some suggestions? @mgoin

企业微信截图_a3caa6ae-dbe7-42a0-8991-bf6ad25f910d

AU-CPU commented 4 months ago

Please tell me if you use the offcial AWQ code here, if you refer to https://github.com/vllm-project/vllm/blob/v0.5.0/examples/fp8/quantizer/quantize.py? I made an obvious error when using awq quantification. Only one file was generated and config.json could not be generated correctly. Please guide me on your awq quantification process.

mgoin commented 4 months ago

This is the vLLM docs page for making AWQ models: https://docs.vllm.ai/en/latest/quantization/auto_awq.html What you linked to seem to be code within the fp8 quantizer that vLLM at least doesn't use for AWQ models.

chenchun0629 commented 4 months ago
# quant code
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, AwqConfig

model_path = "/path to/Qwen2-7B-Instruct"
quant_path = "/path to/Qwen2-7B-Instruct-v1-awq"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model.quantize(tokenizer, quant_config=quant_config)

# quantization_config = AwqConfig(
#     bits=quant_config["w_bit"],
#     group_size=quant_config["q_group_size"],
#     zero_point=quant_config["zero_point"],
#     version=quant_config["version"].lower(),
# ).to_dict()
# 
# model.model.config.quantization_config = quantization_config

model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
{
  "_name_or_path": "/workspace/llm_models/Qwen2-72B-Instruct/",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 29568,
  "max_position_embeddings": 32768,
  "max_window_layers": 70,
  "model_type": "qwen2",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "quantization_config": {
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": null,
    "quant_method": "awq",
    "version": "gemm",
    "zero_point": true
  },
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 131072,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.42.3",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 21112  \
  --model /workspace/llm_models/Qwen2-72B-Instruct-awq-4-v1/     \
  --served-model-name qwen-72b  \
  --trust-remote-code     \
  --max-model-len 8192  \
  --chat-template /workspace/template/qwen.tpl   \
  --disable-custom-all-reduce   \
  --enforce-eager   \
  --tensor-parallel-size 8 \
  --quantization awq
INFO 07-08 12:32:17 api_server.py:206] vLLM API server version 0.5.1
INFO 07-08 12:32:17 api_server.py:207] args: Namespace(host='0.0.0.0', port=21112, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template='/workspace/template/qwen.tpl', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/workspace/llm_models/Qwen2-72B-Instruct-awq-4-v1/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='awq', rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=['sl-qwen-72b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 07-08 12:32:17 config.py:244] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 07-08 12:32:17 config.py:698] Defaulting to use mp for distributed inference
INFO 07-08 12:32:17 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/workspace/llm_models/Qwen2-72B-Instruct-awq-4-v1/', speculative_config=None, tokenizer='/workspace/llm_models/Qwen2-72B-Instruct-awq-4-v1/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=sl-qwen-72b, use_v2_block_manager=False, enable_prefix_caching=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=278168) INFO 07-08 12:32:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278167) INFO 07-08 12:32:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278164) INFO 07-08 12:32:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278166) INFO 07-08 12:32:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278165) INFO 07-08 12:32:28 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278169) INFO 07-08 12:32:28 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278170) INFO 07-08 12:32:28 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278168) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278168) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278164) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278164) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278165) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278165) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278166) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278166) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278167) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278167) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278169) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278169) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278170) INFO 07-08 12:32:30 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278170) INFO 07-08 12:32:30 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size., Traceback (most recent call last):
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process     
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/worker/worker.py", line 133, in load_model
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]     self.model_runner.load_model()
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 243, in load_model
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]     self.model = get_model(
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]                  ^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]     return loader.load_model(model_config=model_config,
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 267, in load_model
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]     model = _initialize_model(model_config, self.load_config,
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model    
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]     return model_class(config=model_config.hf_config,
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 316, in __init__
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]     self.model = Qwen2Model(config, cache_config, quant_config)
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 240, in __init__
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]     self.layers = nn.ModuleList([
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]                                 ^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 241, in <listcomp>
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]     Qwen2DecoderLayer(config, cache_config, quant_config)
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 183, in __init__
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]     self.mlp = Qwen2MLP(
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]                ^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 68, in __init__
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]     self.down_proj = RowParallelLinear(intermediate_size,
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 705, in __init__
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]     self.quant_method.create_weights(
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq.py", line 92, in create_weights    
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]     raise ValueError(
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226] ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
(VllmWorkerProcess pid=278164) ERROR 07-08 12:32:42 multiproc_worker_utils.py:226]
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 216, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 360, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 243, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:                           ^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 153, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 128, in __init__
[rank0]:     super().__init__(model_config, cache_config, parallel_config,
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 42, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 79, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 130, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/worker/worker.py", line 133, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 243, in load_model
[rank0]:     self.model = get_model(
[rank0]:                  ^^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 267, in load_model
[rank0]:     model = _initialize_model(model_config, self.load_config,
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model
[rank0]:     return model_class(config=model_config.hf_config,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 316, in __init__
[rank0]:     self.model = Qwen2Model(config, cache_config, quant_config)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 240, in __init__
[rank0]:     self.layers = nn.ModuleList([
[rank0]:                                 ^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 241, in <listcomp>
[rank0]:     Qwen2DecoderLayer(config, cache_config, quant_config)
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 183, in __init__
[rank0]:     self.mlp = Qwen2MLP(
[rank0]:                ^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 68, in __init__
[rank0]:     self.down_proj = RowParallelLinear(intermediate_size,
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 705, in __init__
[rank0]:     self.quant_method.create_weights(
[rank0]:   File "/workspace/miniconda3/envs/lf/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq.py", line 92, in create_weights
[rank0]:     raise ValueError(
[rank0]: ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
ERROR 07-08 12:32:43 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 278169 died, exit code: -15
INFO 07-08 12:32:43 multiproc_worker_utils.py:123] Killing local vLLM worker processes
/workspace/miniconda3/envs/lf/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
vllm                              0.5.1
vllm-flash-attn                   2.5.9

same problem.

huq2 commented 3 months ago

@chenchun0629 I am facing the same issue. Could you please let me know if you were able to solve it?

chenchun0629 commented 3 months ago

@chenchun0629 I am facing the same issue. Could you please let me know if you were able to solve it?

You can try set q_group_size to 64(2 gpus) or 32(4 gpus)