Closed huangyunxin closed 8 months ago
Same problem.
Same problem. one GPU ,work fine! two GPU report this error!
+1
This happened to me, too, Same problem for Qwen1.5-14B-Chat-AWQ
fixed it!add CUDA_VISIBLE_DEVICES params like follwing examples:
import os os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' llm = LLM(model="/root/autodl-tmp/qwen/Qwen1.5-7B-Chat-GPTQ-Int4" ,tensor_parallel_size=2 ,trust_remote_code=True,dtype='float16',enforce_eager=False ,gpu_memory_utilization=0.9 ,max_model_len=9000)
fixed it!add CUDA_VISIBLE_DEVICES params like follwing examples:
import os os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' llm = LLM(model="/root/autodl-tmp/qwen/Qwen1.5-7B-Chat-GPTQ-Int4" ,tensor_parallel_size=2 ,trust_remote_code=True,dtype='float16',enforce_eager=False ,gpu_memory_utilization=0.9 ,max_model_len=9000)
@lichtud
I used the following command, but it didn't work(vllm 0.3.1): CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-14B-Chat-AWQ --gpu-memory-utilization 0.9 --quantization awq --host 0.0.0.0 --port 1234 --tensor-parallel-size 2 --max-model-len 25000 --served-model-name gpt --trust-remote-code --enforce-eager
Using --tensor-parallel-size 1, Everything works fine.
Using --tensor-parallel-size 2,
File "/home/ma/ENTER/envs/myenv/lib/python3.9/site-packages/vllm/model_executor/layers/quantization/awq.py", line 85, in create_weights
raise ValueError(
ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
Using --tensor-parallel-size 1, Everything works fine.
Using --tensor-parallel-size 2, File "/home/ma/ENTER/envs/myenv/lib/python3.9/site-packages/vllm/model_executor/layers/quantization/awq.py", line 85, in create_weights raise ValueError( ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
if u using api serve methods , actually it don't work instead of using LLM methods . using os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' only works on LLM methods . Also i waiting for solution for api serve methods
@lichtud This is the explanation from GPT. There shouldn't be much difference, right? If it still doesn't work, do we need to wait for the next version for a fix?
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' within a Python script, meaning that the Python program will use GPU devices 0 and 1 when running.
On the other hand, CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' directly in the command line and then launches a Python module vllm.entrypoints.openai.api_server with the specified environment variable.
In terms of functionality, both methods set the environment variable CUDA_VISIBLE_DEVICES to '0,1', but they differ in the way it's done: one is within a Python script and the other is in the command line.
@lichtud This is the explanation from GPT. There shouldn't be much difference, right? If it still doesn't work, do we need to wait for the next version for a fix?
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' within a Python script, meaning that the Python program will use GPU devices 0 and 1 when running.
On the other hand, CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' directly in the command line and then launches a Python module vllm.entrypoints.openai.api_server with the specified environment variable.
In terms of functionality, both methods set the environment variable CUDA_VISIBLE_DEVICES to '0,1', but they differ in the way it's done: one is within a Python script and the other is in the command line.
the member of QWEN group tell me that the kernel of vllm requires intermediate_size (13696) to be an integer multiple of quantized group_size (128) * TP(tensor-parallel-size), so currently TP can only take 1. In addition, the intermediate_size of Qwen1.5-72B-Chat-GPTQ-Int4 is 24576 and 7B is 11008,u can find this in config.json
the same error. I use this:
python -m vllm.entrypoints.openai.api_server \ --model /data/model/baichuan2-13b-chat-awq/ \ --tensor-parallel-size \ --trust-remote-code \ --quantization awq
error message
File "/home/vllm/vllm/vllm/model_executor/layers/linear.py", line 494, in __init__ self.linear_weights = self.linear_method.create_weights( File "/home/vllm/vllm/vllm/model_executor/layers/quantization/awq.py", line 85, in create_weights raise ValueError( ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
so vllm don't support multi-gpus on this model
@lichtud This is the explanation from GPT. There shouldn't be much difference, right? If it still doesn't work, do we need to wait for the next version for a fix? os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' within a Python script, meaning that the Python program will use GPU devices 0 and 1 when running. On the other hand, CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' directly in the command line and then launches a Python module vllm.entrypoints.openai.api_server with the specified environment variable. In terms of functionality, both methods set the environment variable CUDA_VISIBLE_DEVICES to '0,1', but they differ in the way it's done: one is within a Python script and the other is in the command line.
the member of QWEN group tell me that the kernel of vllm requires intermediate_size (13696) to be an integer multiple of quantized group_size (128) * TP(tensor-parallel-size), so currently TP can only take 1. In addition, the intermediate_size of Qwen1.5-72B-Chat-GPTQ-Int4 is 24576 and 7B is 11008,u can find this in config.json
intermediate_size (13696)
I quantified the model using AWQ myself, the intermediate size is 13696,there is still an error
This happened to me, too, Same problem for Qwen1.5-14B-Chat-AWQ
I meet the same problem , do you have any solutions ?
for me that worked on other serving engine but with the same problem was change de group_size in the quant process to accomplish with the requiriment of been multiply of
for me that worked on other serving engine but with the same problem was change de group_size in the quant process to accomplish with the requiriment of been multiply of
how did u solve for 4 gpus which group_size I have to use?
qwen2-72b has same issue using gptq and parallelism, but solve the issue by this method:
group_size sets to 64, fits intermediate_size (29568=1283711) to be an integer multiple of quantized group_size TP(tensor-parallel-size),but group_size sets to 2711=154, it is not ok.
correct "GPTQ_MARLIN_MIN_THREAD_K = 128" to 64 in file "python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py"
@lichtud This is the explanation from GPT. There shouldn't be much difference, right? If it still doesn't work, do we need to wait for the next version for a fix? os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' within a Python script, meaning that the Python program will use GPU devices 0 and 1 when running. On the other hand, CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server sets the environment variable CUDA_VISIBLE_DEVICES to '0,1' directly in the command line and then launches a Python module vllm.entrypoints.openai.api_server with the specified environment variable. In terms of functionality, both methods set the environment variable CUDA_VISIBLE_DEVICES to '0,1', but they differ in the way it's done: one is within a Python script and the other is in the command line.
the member of QWEN group tell me that the kernel of vllm requires intermediate_size (13696) to be an integer multiple of quantized group_size (128) * TP(tensor-parallel-size), so currently TP can only take 1. In addition, the intermediate_size of Qwen1.5-72B-Chat-GPTQ-Int4 is 24576 and 7B is 11008,u can find this in config.json
vllm/vllm-openai:v0.2.7
, startup command:NVIDIA GeForce RTX 4090 24G
*2Qwen-7B-Chat-Int4
works properly, UseQwen-14B-Chat-Int4
error:input_size_per_partition
value is6848
,self.quant_config.group_size
value is128
--tensor-parallel-size 1
works properly,but the answer is wrong:response: { "id": "cmpl-5d4e4ffb826c4ec6972e72453d9698f9", "object": "chat.completion", "created": 599739, "model": "qwen", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "The answer is 1210." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 9, "total_tokens": 19, "completion_tokens": 10 } }