Seek help, `Qwen-14B-Chat-Int4`ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

huangyunxin commented 9 months ago

docker imagevllm/vllm-openai:v0.2.7, startup command:

docker run -it -d -p 5003:8000 \
--name vllm-api \
-v $(pwd)/huggingface:/root/.cache/huggingface \
-v $(pwd)/Qwen-14B-Chat-Int4:/workspace/qwen \
--runtime nvidia --gpus all \
--ipc=host \
vllm/vllm-openai:v0.2.7 \
--trust-remote-code \
--dtype auto \
--tensor-parallel-size 2 \
--quantization gptq \
--model qwen

The graphics card is NVIDIA GeForce RTX 4090 24G*2

UseQwen-7B-Chat-Int4works properly, UseQwen-14B-Chat-Int4error:

INFO 02-01 01:41:59 api_server.py:727] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, model='qwen', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='gptq', enforce_eager=False, max_context_len_to_capture=8192, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 02-01 01:41:59 config.py:175] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-02-01 01:42:02,376 INFO worker.py:1724 -- Started a local Ray instance.
INFO 02-01 01:42:04 llm_engine.py:70] Initializing an LLM engine with config: model='qwen', tokenizer='qwen', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=gptq, enforce_eager=False, seed=0)
WARNING 02-01 01:42:05 tokenizer.py:62] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/vllm/entrypoints/openai/api_server.py", line 737, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/workspace/vllm/engine/async_llm_engine.py", line 500, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/workspace/vllm/engine/async_llm_engine.py", line 273, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/workspace/vllm/engine/async_llm_engine.py", line 318, in _init_engine
return engine_class(*args, **kwargs)
File "/workspace/vllm/engine/llm_engine.py", line 109, in __init__
self._init_workers_ray(placement_group)
File "/workspace/vllm/engine/llm_engine.py", line 249, in _init_workers_ray
self._run_workers(
File "/workspace/vllm/engine/llm_engine.py", line 795, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/workspace/vllm/worker/worker.py", line 81, in load_model
self.model_runner.load_model()
File "/workspace/vllm/worker/model_runner.py", line 64, in load_model
self.model = get_model(self.model_config)
File "/workspace/vllm/model_executor/model_loader.py", line 65, in get_model
model = model_class(model_config.hf_config, linear_method)
File "/workspace/vllm/model_executor/models/qwen.py", line 231, in __init__
self.transformer = QWenModel(config, linear_method)
File "/workspace/vllm/model_executor/models/qwen.py", line 193, in __init__
self.h = nn.ModuleList([
File "/workspace/vllm/model_executor/models/qwen.py", line 194, in <listcomp>
QWenBlock(config, linear_method)
File "/workspace/vllm/model_executor/models/qwen.py", line 147, in __init__
self.mlp = QWenMLP(config.hidden_size,
File "/workspace/vllm/model_executor/models/qwen.py", line 49, in __init__
self.c_proj = RowParallelLinear(intermediate_size,
File "/workspace/vllm/model_executor/layers/linear.py", line 495, in __init__
self.linear_weights = self.linear_method.create_weights(
File "/workspace/vllm/model_executor/layers/quantization/gptq.py", line 100, in create_weights
raise ValueError(
ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

input_size_per_partitionvalue is6848, self.quant_config.group_sizevalue is128

Using a GPU--tensor-parallel-size 1works properly,but the answer is wrong:


request:
{
"max_tokens": 200,
"model": "qwen",
"stream": false,
"stop": ["<|endoftext|>","<|im_end|>","<|im_start|>"],
"messages": [
    {
        "role": "user",
        "content": "hello"
    }
]
}

response: { "id": "cmpl-5d4e4ffb826c4ec6972e72453d9698f9", "object": "chat.completion", "created": 599739, "model": "qwen", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "The answer is 1210." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 9, "total_tokens": 19, "completion_tokens": 10 } }


* Qwen-14B-Chat-Int4 is not supported, or I am using it in a wrong way? Thank you for all your assistance.

whyiug commented 9 months ago

1

lzhfe commented 9 months ago

Same problem.

skyantao commented 9 months ago

Same problem. one GPU ,work fine! two GPU report this error!

hediyuan commented 8 months ago

+1

maxin9966 commented 8 months ago

This happened to me, too, Same problem for Qwen1.5-14B-Chat-AWQ

lichtud commented 8 months ago

fixed it！add CUDA_VISIBLE_DEVICES params like follwing examples:

import os os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' llm = LLM(model="/root/autodl-tmp/qwen/Qwen1.5-7B-Chat-GPTQ-Int4" ,tensor_parallel_size=2 ,trust_remote_code=True,dtype='float16',enforce_eager=False ,gpu_memory_utilization=0.9 ,max_model_len=9000)

maxin9966 commented 8 months ago

fixed it！add CUDA_VISIBLE_DEVICES params like follwing examples:

import os os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' llm = LLM(model="/root/autodl-tmp/qwen/Qwen1.5-7B-Chat-GPTQ-Int4" ,tensor_parallel_size=2 ,trust_remote_code=True,dtype='float16',enforce_eager=False ,gpu_memory_utilization=0.9 ,max_model_len=9000)

@lichtud

I used the following command, but it didn't work(vllm 0.3.1): CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-14B-Chat-AWQ --gpu-memory-utilization 0.9 --quantization awq --host 0.0.0.0 --port 1234 --tensor-parallel-size 2 --max-model-len 25000 --served-model-name gpt --trust-remote-code --enforce-eager

maxin9966 commented 8 months ago

Using --tensor-parallel-size 1, Everything works fine.

Using --tensor-parallel-size 2,
File "/home/ma/ENTER/envs/myenv/lib/python3.9/site-packages/vllm/model_executor/layers/quantization/awq.py", line 85, in create_weights raise ValueError( ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

lichtud commented 8 months ago

Using --tensor-parallel-size 1, Everything works fine.

Using --tensor-parallel-size 2, File "/home/ma/ENTER/envs/myenv/lib/python3.9/site-packages/vllm/model_executor/layers/quantization/awq.py", line 85, in create_weights raise ValueError( ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

if u using api serve methods , actually it don't work instead of using LLM methods . using os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' only works on LLM methods . Also i waiting for solution for api serve methods

maxin9966 commented 8 months ago

@lichtud This is the explanation from GPT. There shouldn't be much difference, right? If it still doesn't work, do we need to wait for the next version for a fix?

os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' within a Python script, meaning that the Python program will use GPU devices 0 and 1 when running.

On the other hand, CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' directly in the command line and then launches a Python module vllm.entrypoints.openai.api_server with the specified environment variable.

In terms of functionality, both methods set the environment variable CUDA_VISIBLE_DEVICES to '0,1', but they differ in the way it's done: one is within a Python script and the other is in the command line.

lichtud commented 8 months ago

@lichtud This is the explanation from GPT. There shouldn't be much difference, right? If it still doesn't work, do we need to wait for the next version for a fix?

os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' within a Python script, meaning that the Python program will use GPU devices 0 and 1 when running.

On the other hand, CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' directly in the command line and then launches a Python module vllm.entrypoints.openai.api_server with the specified environment variable.

In terms of functionality, both methods set the environment variable CUDA_VISIBLE_DEVICES to '0,1', but they differ in the way it's done: one is within a Python script and the other is in the command line.

the member of QWEN group tell me that the kernel of vllm requires intermediate_size (13696) to be an integer multiple of quantized group_size (128) * TP(tensor-parallel-size), so currently TP can only take 1. In addition, the intermediate_size of Qwen1.5-72B-Chat-GPTQ-Int4 is 24576 and 7B is 11008,u can find this in config.json

hahazei commented 8 months ago

the same error. I use this: python -m vllm.entrypoints.openai.api_server \ --model /data/model/baichuan2-13b-chat-awq/ \ --tensor-parallel-size \ --trust-remote-code \ --quantization awq error message File "/home/vllm/vllm/vllm/model_executor/layers/linear.py", line 494, in __init__ self.linear_weights = self.linear_method.create_weights( File "/home/vllm/vllm/vllm/model_executor/layers/quantization/awq.py", line 85, in create_weights raise ValueError( ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

zollty commented 7 months ago

so vllm don't support multi-gpus on this model

ccp123456789 commented 6 months ago

@lichtud This is the explanation from GPT. There shouldn't be much difference, right? If it still doesn't work, do we need to wait for the next version for a fix? os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' within a Python script, meaning that the Python program will use GPU devices 0 and 1 when running. On the other hand, CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' directly in the command line and then launches a Python module vllm.entrypoints.openai.api_server with the specified environment variable. In terms of functionality, both methods set the environment variable CUDA_VISIBLE_DEVICES to '0,1', but they differ in the way it's done: one is within a Python script and the other is in the command line.

the member of QWEN group tell me that the kernel of vllm requires intermediate_size (13696) to be an integer multiple of quantized group_size (128) * TP(tensor-parallel-size), so currently TP can only take 1. In addition, the intermediate_size of Qwen1.5-72B-Chat-GPTQ-Int4 is 24576 and 7B is 11008,u can find this in config.json

intermediate_size (13696)

I quantified the model using AWQ myself, the intermediate size is 13696,there is still an error

Ximingwang-09 commented 6 months ago

This happened to me, too, Same problem for Qwen1.5-14B-Chat-AWQ

I meet the same problem , do you have any solutions ?

puppetm4st3r commented 4 months ago

for me that worked on other serving engine but with the same problem was change de group_size in the quant process to accomplish with the requiriment of been multiply of

fersebas commented 4 months ago

for me that worked on other serving engine but with the same problem was change de group_size in the quant process to accomplish with the requiriment of been multiply of

how did u solve for 4 gpus which group_size I have to use?

wangye360 commented 4 months ago

qwen2-72b has same issue using gptq and parallelism, but solve the issue by this method:

group_size sets to 64, fits intermediate_size (29568=1283711) to be an integer multiple of quantized group_size TP(tensor-parallel-size)，but group_size sets to 2711=154, it is not ok.
correct "GPTQ_MARLIN_MIN_THREAD_K = 128" to 64 in file "python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py"

@lichtud This is the explanation from GPT. There shouldn't be much difference, right? If it still doesn't work, do we need to wait for the next version for a fix? os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' within a Python script, meaning that the Python program will use GPU devices 0 and 1 when running. On the other hand, CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' directly in the command line and then launches a Python module vllm.entrypoints.openai.api_server with the specified environment variable. In terms of functionality, both methods set the environment variable CUDA_VISIBLE_DEVICES to '0,1', but they differ in the way it's done: one is within a Python script and the other is in the command line.

the member of QWEN group tell me that the kernel of vllm requires intermediate_size (13696) to be an integer multiple of quantized group_size (128) * TP(tensor-parallel-size), so currently TP can only take 1. In addition, the intermediate_size of Qwen1.5-72B-Chat-GPTQ-Int4 is 24576 and 7B is 11008,u can find this in config.json

vllm-project / vllm

Seek help, `Qwen-14B-Chat-Int4`ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size. #2699