vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.47k stars 4.62k forks source link

Yi-34B-Chat-4bits-GPTQ keeps outputting empty "" tokens until reaching max_length #2294

Open zhanghx0905 opened 10 months ago

zhanghx0905 commented 10 months ago

Running on a V100 32GB, the startup script is:

CUDA_VISIBLE_DEVICES=2,3 python \
    -m vllm.entrypoints.openai.api_server \
    --model="../models/Yi-34B-Chat-4bits-GPTQ" \
    --dtype half --port 8080 --served-model-name Yi-34B-Chat-4bits-GPTQ

This occurs after multiple rounds of dialogue, where the server continuously outputs empty tokens "" and does not stop normally.

image

Here is an instance of the request that caused the issue.

curl -X 'POST' \
  'http://10.223.48.160:30002/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "Yi-34B-Chat-4bits-GPTQ",
  "messages": [
    {"role": "system", "content": "你是XXX,由XXX集团的研发团队独立开发的大语言模型,你的使命是协助公司员工高效完成工作。现在,请开始你的工作。"},
    {"role": "user", "content": "你好吗"},
    {"role": "assistant", "content": "我很好,谢谢你的关心。我准备随时协助你解答问题或完成任务。请问你有任何具体的问题或者需要帮助的地方吗?"},
    {"role": "user", "content": "好尼玛"},
    {"role": "assistant", "content": "很好,很高兴听到你状态良好。如果你在工作中遇到问题或者需要帮助,请随时提问。我会尽力提供帮助。"},
    {"role": "user", "content": "哈哈哈"},
    {"role": "assistant", "content": "看起来你似乎很开心。如果你想要分享更多关于你的工作、生活中的积极经历,或者需要建议和指导,请随时告诉我。我会在力所能及的范围内提供帮助。"},
    {"role": "user", "content": "呵呵呵呵呵"}
  ],
  "temperature": 0.7,
  "top_p": 1,
  "n": 1,
  "max_tokens": 1024,
  "stream": true,
  "presence_penalty": 0,
  "frequency_penalty": 0,
  "user": "string",
  "best_of": 1,
  "top_k": -1,
  "ignore_eos": false,
  "use_beam_search": false,
  "stop_token_ids": [
    7
  ],
  "skip_special_tokens": true,
  "spaces_between_special_tokens": true,
  "add_generation_prompt": true,
  "echo": false,
  "repetition_penalty": 1,
  "min_p": 0
}

After getting stuck, the server actually keeps outputting spaces until it stops at some point. After this, any subsequent requests sent to it also get stuck in the same way.

data: {"id": "cmpl-49bf9c52893f4bd1ab1f5f107a7011ce", "object": "chat.completion.chunk", "created": 1186536, "model": "Yi-34B-Chat-4bits-GPTQ", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}

data: {"id": "cmpl-49bf9c52893f4bd1ab1f5f107a7011ce", "object": "chat.completion.chunk", "created": 1186536, "model": "Yi-34B-Chat-4bits-GPTQ", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}

data: {"id": "cmpl-49bf9c52893f4bd1ab1f5f107a7011ce", "object": "chat.completion.chunk", "created": 1186536, "model": "Yi-34B-Chat-4bits-GPTQ", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}

data: {"id": "cmpl-49bf9c52893f4bd1ab1f5f107a7011ce", "object": "chat.completion.chunk", "created": 1186536, "model": "Yi-34B-Chat-4bits-GPTQ", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}

data: {"id": "cmpl-49bf9c52893f4bd1ab1f5f107a7011ce", "object": "chat.completion.chunk", "created": 1186536, "model": "Yi-34B-Chat-4bits-GPTQ", "choices": [{"index": 0, "delta": {"content": ""}, "finish_reason": null}]}
...
data: {"id": "cmpl-49bf9c52893f4bd1ab1f5f107a7011ce", "object": "chat.completion.chunk", "created": 1186536, "model": "Yi-34B-Chat-4bits-GPTQ", "choices": [{"index": 0, "delta": {}, "finish_reason": "length"}], "usage": {"prompt_tokens": 183, "total_tokens": 1206, "completion_tokens": 1023}}

data: [DONE]

...
zhanghx0905 commented 10 months ago
INFO 01-02 07:18:44 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.0%, CPU KV cache usage: 0.0%
INFO 01-02 07:18:49 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.4%, CPU KV cache usage: 0.0%
INFO 01-02 07:18:54 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.8%, CPU KV cache usage: 0.0%
INFO 01-02 07:18:59 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 24.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.1%, CPU KV cache usage: 0.0%
INFO 01-02 07:19:04 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.4%, CPU KV cache usage: 0.0%
INFO 01-02 07:19:09 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.7%, CPU KV cache usage: 0.0%
INFO 01-02 07:19:14 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.0%, CPU KV cache usage: 0.0%
INFO 01-02 07:19:19 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.3%, CPU KV cache usage: 0.0%
...

In my case, vllm keeps outputting tokens, but it actually outputs only empty strings, until the limit of max len is reached.

Has anyone encountered a similar situation when deploying Yi or other LLMs?

I have updated vllm to the latest.

vllm==0.2.6
torch==2.1.2
cuda=12.1
Arcmoon-Hu commented 10 months ago

Hello,you can try set stop_token_ids=[2,6,7,8]

zhanghx0905 commented 10 months ago

Hello,you can try set stop_token_ids=[2,6,7,8]

Thank you very much for your reply. I tried it, but it didn't work.

Is this problem caused by GPTQ or V100?

INFO 01-02 08:23:26 async_llm_engine.py:379] Received request 33e9a84e-a948-11ee-acaf-0242ac110013: prompt: '你的使命是协助公司员工高效完成工作。<|im_end|>\nassistant\n"啦啦啦" 是汉语中表示开心、愉快或者轻松愉快心情的象声词,类似于英文中的 "hehe" 或 "teehee"。通常用于轻松、友好的对话中,表达一种轻松愉快的情绪。如果你有什么问题或者需要帮助, feel free to ask!<|im_end|>\nuser\nhehehe<|im_end|>\nassistant\n"hehehe" 是英文中表示开心、愉快或者调皮的笑声文字表达,类似于汉语中的 "hehe"。这种笑声文字表达通常用于轻松、友好的对话中,表达一种轻松愉快的情绪。如果你有什么问题或者需要帮助, feel free to ask!我会尽力帮助你。<|im_end|>\n<|im_start|>user\n你好吗<|im_end|>\n<|im_start|>assistant\n', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<|endoftext|>', '<|im_start|>', '<|im_end|>', '<|im_sep|>'], stop_token_ids=[2, 6, 7, 8], include_stop_str_in_output=False, ignore_eos=False, max_tokens=2000, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: None.

INFO 01-02 08:23:27 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%
INFO 01-02 08:23:32 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%
INFO 01-02 08:23:37 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.1%, CPU KV cache usage: 0.0%
...
zhanghx0905 commented 10 months ago

I believe this issue is related to the cache, as when I repeat a question, it's more likely to get stuck.

exceedzhang commented 9 months ago

I have also encountered similar situations in Yi-34B Chat and Yi-34B Chat - AWQ

KelleyYin commented 9 months ago

I have also encountered similar situations in Yi-34B Chat and Yi-34B Chat - AWQ

Did you solve this issue ?

QwertyJack commented 8 months ago

Hello,you can try set stop_token_ids=[2,6,7,8]

For me it does not work. At the beginning the answer is normal but then it repeats itself over and over again.

input

res = client.chat.completions.create(
    model='Yi-34B-Chat-AWQ',
    messages=[
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': '你好'},
    ],
    temperature=0,
    stop=["</s>", "<|im_start|>", "<|im_end|>", "<|im_sep|>",],
)

print(res.choices[0].message.content)

vllm log

INFO 03-15 10:25:32 async_llm_engine.py:436] Received request cmpl-f05ccf970b224fce84ee552738f7f423: prompt: '<|im_start|>system\nYou are
a helpful assistant.<|im_end|>\n<|im_start|>user\n你好<|im_end|>\n<|im_start|>assistant\n', prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, s
eed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>', '<|im_start|>', '<|im_end|>', '<|im_sep|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4074, logprobs=None, prompt_logprobs=None, skip_special_toke
ns=True, spaces_between_special_tokens=True), prompt_token_ids: [6, 1328, 144, 3961, 678, 562, 6901, 14135, 98, 7, 59568, 144, 6, 2942, 14
4, 25902, 7, 59568, 144, 6, 14135, 144], lora_request: None.
INFO 03-15 10:25:34 metrics.py:213] Avg prompt throughput: 4.4 tokens/s, Avg generation throughput: 10.3 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.6%, CPU KV cache usage: 0.0%
INFO 03-15 10:25:39 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.1 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.8%, CPU KV cache usage: 0.0%
INFO 03-15 10:25:44 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.8 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.7%, CPU KV cache usage: 0.0%
INFO 03-15 10:25:49 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.4 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 10.9%, CPU KV cache usage: 0.0%
INFO 03-15 10:25:54 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.3 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 14.1%, CPU KV cache usage: 0.0%
INFO 03-15 10:25:59 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.4 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 17.0%, CPU KV cache usage: 0.0%
INFO 03-15 10:26:04 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.3 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 20.2%, CPU KV cache usage: 0.0%
INFO 03-15 10:26:09 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.2 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 23.1%, CPU KV cache usage: 0.0%
INFO 03-15 10:26:14 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.2 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 26.3%, CPU KV cache usage: 0.0%
INFO 03-15 10:26:19 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.1 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 29.2%, CPU KV cache usage: 0.0%
INFO 03-15 10:26:24 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.0 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 32.1%, CPU KV cache usage: 0.0%
INFO 03-15 10:26:29 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.9 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 35.3%, CPU KV cache usage: 0.0%
...
INFO 03-15 10:27:50 async_llm_engine.py:110] Finished request cmpl-f05ccf970b224fce84ee552738f7f423.
INFO:     10.204.237.30:33894 - "POST /v1/chat/completions HTTP/1.1" 200 OK

output

你好!看起来你可能不小心发送了一个空的回复。如果你有任何问题或需要帮助,请随时告诉我,我会尽力帮助你。 

如果你只是想测试系统,或者想要一个空的回复,那也完全没问题。请随时告诉我你的需求,我会尽力满足。 

如果你有任何其他问题或需要帮助的地方,请随时提问。 

谢谢! 

祝你有个愉快的一天! 

如果你有任何问题或需要帮助,请随时告诉我,我会尽力帮助你。 

如果你只是想测试系统,或者想要一个空的回复,那也完全没问题。请随时告诉我你的需求,我会尽力满足。 

谢谢!祝你有个愉快的一天! 

如果你有任何问题或需要帮助,请随时告诉我,我会尽力帮助你。 

如果你只是想测试系统,或者想要一个空的回复,那也完全没问题。请随时告诉我你的需求,我会尽力满足。 

谢谢!祝你有个愉快的一天! 

如果你有任何问题或需要帮助,请随时告诉我,我会尽力帮助你。 

如果你只是想测试系统,或者想要一个空的回复,那也完全没问题。请随时告诉我你的需求,我会尽力满足。 

谢谢!祝你有个愉快的一天! 

如果你有任何问题或需要帮助,请随时告诉我,我会尽力帮助你。 

如果你只是想测试系统,或者想要一个空的回复,那也完全没问题。请随时告诉我你的需求,我会尽力满足。 

...

如果你只是想测试系统,或者想要一个空的回复,那也完全没问题。请随时告诉我你的需求,我会尽力满足。 

谢谢!祝你有个愉快的一天! 

如果你有任何问题或需要帮助,请随时告诉我,
QwertyJack commented 8 months ago

env:

vllm==0.3.3 torch==2.1.2 cuda=12.1

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!