[Bug]: Qwen1.5-14B-Chat使用vllm==0.3.3版本在Tesla V100-PCIE-32GB显卡上部署结果全部是感叹号，无结果

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

26.09k stars 3.82k forks source link

[Bug]: Qwen1.5-14B-Chat使用vllm==0.3.3版本在Tesla V100-PCIE-32GB显卡上部署结果全部是感叹号，无结果 #3998

Open li995495592 opened 5 months ago

li995495592 commented 5 months ago

Your current environment

Qwen1.5-14B-Chat使用vllm==0.3.3版本在Tesla V100-PCIE-32GB显卡上部署结果全部是感叹号，无结果

🐛 Describe the bug

Qwen1.5-14B-Chat使用vllm==0.3.3版本在Tesla V100-PCIE-32GB显卡上部署结果全部是感叹号，无结果，部署环境未报错

fevolq commented 5 months ago

在对话最开始添加上系统提示词试下 {"role": "system", "content": "..."}

li995495592 commented 5 months ago

已经加了

xin-li-67 commented 4 months ago

According to Qwen people, please try the v0.3.2 version

Huarong commented 4 months ago

According to Qwen people, please try the v0.3.2 version

Vllm v0.3.2 version also have the same problem.

rbgo404 commented 4 months ago

I am facing the same issue with this model.

ailinbest commented 4 months ago

I also have this problem with qwen1.5-0.5B-chat model which is supervised fine-tuned on transformers 4.38.2.

esmeetu commented 4 months ago

@li995495592 Could you try vllm=0.4.0.post1?

Huarong commented 4 months ago

@li995495592 Could you try vllm=0.4.0.post1?

@esmeetu vllm 0.3.2, 0.3.3, 0.4.0, 0.4.0.post1 all have this problem when serving qwen-1.5-14b-gptq-int4

xin-li-67 commented 4 months ago

@li995495592 Could you try vllm=0.4.0.post1?

@esmeetu vllm 0.3.2, 0.3.3, 0.4.0, 0.4.0.post1 all have this problem when serving qwen-1.5-14b-gptq-int4

Hi Huarong, I compiled the latest version (0.4.0.post1) of the vllm locally and successfully ran both the offline inference demo and the openai style API server inference. Here is the screenshot:

Huarong commented 4 months ago

Thanks for trying. @xin-li-67 The !!!! occurred from time to time depending on the prompts. Can you try with more than 100 samples?

Huarong commented 4 months ago

More details:

We can get right result from bf16 model. But when inferencing our trained qwen1.5-14b-gptq-int4 model, nan may occur in the occasion of the prompts where the output probability is very high. The output are a lot of !!!!!. !!! mainly follows digits like 1 or 2.

auto-gptq may not be the problem for the result is ok if we inference with transformers instead of vllm.

vllm serving with awq int4 is ok.

Versions:

vllm: v0.3.2, v0.3.3, v0.4.0, v0.4.0.post1 all have the problem. We used the pre-built wheel of pyhton3.8 and cuda118 in the repo.
auto-gpt: v0.7.1
cuda: v11.8
torch: 2.1.2

kratorado commented 3 months ago

Is there any progress ?

kasoushu commented 3 months ago

How to solve it, my qwen1.5-0.5b trained on hh dataset, almost can not generate nomal response using vllm. However, i can get normal response in hf

zhang-xh95 commented 3 months ago

met the same problem

fevolq commented 3 months ago

使用中文提示词试下？

Sanster commented 2 months ago

Same problem here with glm-4-9b-chat model

vllm: v0.5.0.post1
dtype: bfloat16
cuda: 12.4
torch: 2.3.0
gpu: 1x3090

update: reduce --max-model-len from 8000->6144 and reduce --gpu-memory-utilization from 0.95 -> 0.9 fix the problem

Rose-upForever commented 2 months ago

Is there any progress ?

zxjhellow2 commented 1 month ago

同样的问题，有没有大佬解释一下，下面几张图片分别尝试了英文的提示词和中文的提示词，中文的正常

Cyich commented 2 weeks ago

I also have this problem with qwen1.5-0.5B-chat model which is supervised fine-tuned on transformers 4.38.2.

have you solved this problem? I also very disturbed by this, and I tried not sfted qwen1.5-0.5B-chat model, it can work well, but my sfted qwen1.5-0.5B-chat model can not work (inference with vLLM. btw, inference with official script can work well), inference result totally repeat, and my vLLM version is 0.3.0. I am not sure how to solve this problem. If you solved, please share, thx! My repeat output like: