vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.64k stars 4.47k forks source link

Seek help, `Qwen-14B-Chat-Int4`ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size. #2699

Closed huangyunxin closed 8 months ago

huangyunxin commented 9 months ago

response: { "id": "cmpl-5d4e4ffb826c4ec6972e72453d9698f9", "object": "chat.completion", "created": 599739, "model": "qwen", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "The answer is 1210." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 9, "total_tokens": 19, "completion_tokens": 10 } }


* Qwen-14B-Chat-Int4 is not supported, or I am using it in a wrong way? Thank you for all your assistance.
whyiug commented 9 months ago
lzhfe commented 9 months ago

Same problem.

skyantao commented 9 months ago

Same problem. one GPU ,work fine! two GPU report this error!

hediyuan commented 8 months ago

+1

maxin9966 commented 8 months ago

This happened to me, too, Same problem for Qwen1.5-14B-Chat-AWQ

lichtud commented 8 months ago

fixed it!add CUDA_VISIBLE_DEVICES params like follwing examples:

import os os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' llm = LLM(model="/root/autodl-tmp/qwen/Qwen1.5-7B-Chat-GPTQ-Int4" ,tensor_parallel_size=2 ,trust_remote_code=True,dtype='float16',enforce_eager=False ,gpu_memory_utilization=0.9 ,max_model_len=9000)

maxin9966 commented 8 months ago

fixed it!add CUDA_VISIBLE_DEVICES params like follwing examples:

import os os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' llm = LLM(model="/root/autodl-tmp/qwen/Qwen1.5-7B-Chat-GPTQ-Int4" ,tensor_parallel_size=2 ,trust_remote_code=True,dtype='float16',enforce_eager=False ,gpu_memory_utilization=0.9 ,max_model_len=9000)

@lichtud

I used the following command, but it didn't work(vllm 0.3.1): CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-14B-Chat-AWQ --gpu-memory-utilization 0.9 --quantization awq --host 0.0.0.0 --port 1234 --tensor-parallel-size 2 --max-model-len 25000 --served-model-name gpt --trust-remote-code --enforce-eager

maxin9966 commented 8 months ago

Using --tensor-parallel-size 1, Everything works fine.

Using --tensor-parallel-size 2,
File "/home/ma/ENTER/envs/myenv/lib/python3.9/site-packages/vllm/model_executor/layers/quantization/awq.py", line 85, in create_weights raise ValueError( ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

lichtud commented 8 months ago

Using --tensor-parallel-size 1, Everything works fine.

Using --tensor-parallel-size 2, File "/home/ma/ENTER/envs/myenv/lib/python3.9/site-packages/vllm/model_executor/layers/quantization/awq.py", line 85, in create_weights raise ValueError( ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

if u using api serve methods , actually it don't work instead of using LLM methods . using os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' only works on LLM methods . Also i waiting for solution for api serve methods

maxin9966 commented 8 months ago

@lichtud This is the explanation from GPT. There shouldn't be much difference, right? If it still doesn't work, do we need to wait for the next version for a fix?

os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' within a Python script, meaning that the Python program will use GPU devices 0 and 1 when running.

On the other hand, CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' directly in the command line and then launches a Python module vllm.entrypoints.openai.api_server with the specified environment variable.

In terms of functionality, both methods set the environment variable CUDA_VISIBLE_DEVICES to '0,1', but they differ in the way it's done: one is within a Python script and the other is in the command line.

lichtud commented 8 months ago

@lichtud This is the explanation from GPT. There shouldn't be much difference, right? If it still doesn't work, do we need to wait for the next version for a fix?

os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' within a Python script, meaning that the Python program will use GPU devices 0 and 1 when running.

On the other hand, CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' directly in the command line and then launches a Python module vllm.entrypoints.openai.api_server with the specified environment variable.

In terms of functionality, both methods set the environment variable CUDA_VISIBLE_DEVICES to '0,1', but they differ in the way it's done: one is within a Python script and the other is in the command line.

the member of QWEN group tell me that the kernel of vllm requires intermediate_size (13696) to be an integer multiple of quantized group_size (128) * TP(tensor-parallel-size), so currently TP can only take 1. In addition, the intermediate_size of Qwen1.5-72B-Chat-GPTQ-Int4 is 24576 and 7B is 11008,u can find this in config.json

hahazei commented 8 months ago

the same error. I use this: python -m vllm.entrypoints.openai.api_server \ --model /data/model/baichuan2-13b-chat-awq/ \ --tensor-parallel-size \ --trust-remote-code \ --quantization awq error message File "/home/vllm/vllm/vllm/model_executor/layers/linear.py", line 494, in __init__ self.linear_weights = self.linear_method.create_weights( File "/home/vllm/vllm/vllm/model_executor/layers/quantization/awq.py", line 85, in create_weights raise ValueError( ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

zollty commented 7 months ago

so vllm don't support multi-gpus on this model

ccp123456789 commented 6 months ago

@lichtud This is the explanation from GPT. There shouldn't be much difference, right? If it still doesn't work, do we need to wait for the next version for a fix? os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' within a Python script, meaning that the Python program will use GPU devices 0 and 1 when running. On the other hand, CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' directly in the command line and then launches a Python module vllm.entrypoints.openai.api_server with the specified environment variable. In terms of functionality, both methods set the environment variable CUDA_VISIBLE_DEVICES to '0,1', but they differ in the way it's done: one is within a Python script and the other is in the command line.

the member of QWEN group tell me that the kernel of vllm requires intermediate_size (13696) to be an integer multiple of quantized group_size (128) * TP(tensor-parallel-size), so currently TP can only take 1. In addition, the intermediate_size of Qwen1.5-72B-Chat-GPTQ-Int4 is 24576 and 7B is 11008,u can find this in config.json

intermediate_size (13696)

I quantified the model using AWQ myself, the intermediate size is 13696,there is still an error

Ximingwang-09 commented 6 months ago

This happened to me, too, Same problem for Qwen1.5-14B-Chat-AWQ

I meet the same problem , do you have any solutions ?

puppetm4st3r commented 4 months ago

for me that worked on other serving engine but with the same problem was change de group_size in the quant process to accomplish with the requiriment of been multiply of

fersebas commented 4 months ago

for me that worked on other serving engine but with the same problem was change de group_size in the quant process to accomplish with the requiriment of been multiply of

how did u solve for 4 gpus which group_size I have to use?

wangye360 commented 4 months ago

qwen2-72b has same issue using gptq and parallelism, but solve the issue by this method:

  1. group_size sets to 64, fits intermediate_size (29568=1283711) to be an integer multiple of quantized group_size TP(tensor-parallel-size),but group_size sets to 2711=154, it is not ok.

  2. correct "GPTQ_MARLIN_MIN_THREAD_K = 128" to 64 in file "python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py"

@lichtud This is the explanation from GPT. There shouldn't be much difference, right? If it still doesn't work, do we need to wait for the next version for a fix? os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' within a Python script, meaning that the Python program will use GPU devices 0 and 1 when running. On the other hand, CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' directly in the command line and then launches a Python module vllm.entrypoints.openai.api_server with the specified environment variable. In terms of functionality, both methods set the environment variable CUDA_VISIBLE_DEVICES to '0,1', but they differ in the way it's done: one is within a Python script and the other is in the command line.

the member of QWEN group tell me that the kernel of vllm requires intermediate_size (13696) to be an integer multiple of quantized group_size (128) * TP(tensor-parallel-size), so currently TP can only take 1. In addition, the intermediate_size of Qwen1.5-72B-Chat-GPTQ-Int4 is 24576 and 7B is 11008,u can find this in config.json