Open zhanghx0905 opened 10 months ago
INFO 01-02 07:18:44 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.0%, CPU KV cache usage: 0.0%
INFO 01-02 07:18:49 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.4%, CPU KV cache usage: 0.0%
INFO 01-02 07:18:54 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.8%, CPU KV cache usage: 0.0%
INFO 01-02 07:18:59 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 24.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.1%, CPU KV cache usage: 0.0%
INFO 01-02 07:19:04 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.4%, CPU KV cache usage: 0.0%
INFO 01-02 07:19:09 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.7%, CPU KV cache usage: 0.0%
INFO 01-02 07:19:14 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.0%, CPU KV cache usage: 0.0%
INFO 01-02 07:19:19 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.3%, CPU KV cache usage: 0.0%
...
In my case, vllm keeps outputting tokens, but it actually outputs only empty strings, until the limit of max len is reached.
Has anyone encountered a similar situation when deploying Yi or other LLMs?
I have updated vllm to the latest.
vllm==0.2.6
torch==2.1.2
cuda=12.1
Hello,you can try set stop_token_ids=[2,6,7,8]
Hello,you can try set stop_token_ids=[2,6,7,8]
Thank you very much for your reply. I tried it, but it didn't work.
Is this problem caused by GPTQ or V100?
INFO 01-02 08:23:26 async_llm_engine.py:379] Received request 33e9a84e-a948-11ee-acaf-0242ac110013: prompt: '你的使命是协助公司员工高效完成工作。<|im_end|>\nassistant\n"啦啦啦" 是汉语中表示开心、愉快或者轻松愉快心情的象声词,类似于英文中的 "hehe" 或 "teehee"。通常用于轻松、友好的对话中,表达一种轻松愉快的情绪。如果你有什么问题或者需要帮助, feel free to ask!<|im_end|>\nuser\nhehehe<|im_end|>\nassistant\n"hehehe" 是英文中表示开心、愉快或者调皮的笑声文字表达,类似于汉语中的 "hehe"。这种笑声文字表达通常用于轻松、友好的对话中,表达一种轻松愉快的情绪。如果你有什么问题或者需要帮助, feel free to ask!我会尽力帮助你。<|im_end|>\n<|im_start|>user\n你好吗<|im_end|>\n<|im_start|>assistant\n', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<|endoftext|>', '<|im_start|>', '<|im_end|>', '<|im_sep|>'], stop_token_ids=[2, 6, 7, 8], include_stop_str_in_output=False, ignore_eos=False, max_tokens=2000, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: None.
INFO 01-02 08:23:27 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%
INFO 01-02 08:23:32 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%
INFO 01-02 08:23:37 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.1%, CPU KV cache usage: 0.0%
...
I believe this issue is related to the cache, as when I repeat a question, it's more likely to get stuck.
I have also encountered similar situations in Yi-34B Chat and Yi-34B Chat - AWQ
I have also encountered similar situations in Yi-34B Chat and Yi-34B Chat - AWQ
Did you solve this issue ?
Hello,you can try set stop_token_ids=[2,6,7,8]
For me it does not work. At the beginning the answer is normal but then it repeats itself over and over again.
input
res = client.chat.completions.create(
model='Yi-34B-Chat-AWQ',
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': '你好'},
],
temperature=0,
stop=["</s>", "<|im_start|>", "<|im_end|>", "<|im_sep|>",],
)
print(res.choices[0].message.content)
vllm log
INFO 03-15 10:25:32 async_llm_engine.py:436] Received request cmpl-f05ccf970b224fce84ee552738f7f423: prompt: '<|im_start|>system\nYou are
a helpful assistant.<|im_end|>\n<|im_start|>user\n你好<|im_end|>\n<|im_start|>assistant\n', prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, s
eed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>', '<|im_start|>', '<|im_end|>', '<|im_sep|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4074, logprobs=None, prompt_logprobs=None, skip_special_toke
ns=True, spaces_between_special_tokens=True), prompt_token_ids: [6, 1328, 144, 3961, 678, 562, 6901, 14135, 98, 7, 59568, 144, 6, 2942, 14
4, 25902, 7, 59568, 144, 6, 14135, 144], lora_request: None.
INFO 03-15 10:25:34 metrics.py:213] Avg prompt throughput: 4.4 tokens/s, Avg generation throughput: 10.3 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.6%, CPU KV cache usage: 0.0%
INFO 03-15 10:25:39 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.1 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.8%, CPU KV cache usage: 0.0%
INFO 03-15 10:25:44 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.8 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.7%, CPU KV cache usage: 0.0%
INFO 03-15 10:25:49 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.4 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 10.9%, CPU KV cache usage: 0.0%
INFO 03-15 10:25:54 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.3 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 14.1%, CPU KV cache usage: 0.0%
INFO 03-15 10:25:59 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.4 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 17.0%, CPU KV cache usage: 0.0%
INFO 03-15 10:26:04 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.3 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 20.2%, CPU KV cache usage: 0.0%
INFO 03-15 10:26:09 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.2 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 23.1%, CPU KV cache usage: 0.0%
INFO 03-15 10:26:14 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.2 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 26.3%, CPU KV cache usage: 0.0%
INFO 03-15 10:26:19 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.1 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 29.2%, CPU KV cache usage: 0.0%
INFO 03-15 10:26:24 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.0 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 32.1%, CPU KV cache usage: 0.0%
INFO 03-15 10:26:29 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.9 tokens/s, Running: 1 reqs, Swappe
d: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 35.3%, CPU KV cache usage: 0.0%
...
INFO 03-15 10:27:50 async_llm_engine.py:110] Finished request cmpl-f05ccf970b224fce84ee552738f7f423.
INFO: 10.204.237.30:33894 - "POST /v1/chat/completions HTTP/1.1" 200 OK
output
你好!看起来你可能不小心发送了一个空的回复。如果你有任何问题或需要帮助,请随时告诉我,我会尽力帮助你。
如果你只是想测试系统,或者想要一个空的回复,那也完全没问题。请随时告诉我你的需求,我会尽力满足。
如果你有任何其他问题或需要帮助的地方,请随时提问。
谢谢!
祝你有个愉快的一天!
如果你有任何问题或需要帮助,请随时告诉我,我会尽力帮助你。
如果你只是想测试系统,或者想要一个空的回复,那也完全没问题。请随时告诉我你的需求,我会尽力满足。
谢谢!祝你有个愉快的一天!
如果你有任何问题或需要帮助,请随时告诉我,我会尽力帮助你。
如果你只是想测试系统,或者想要一个空的回复,那也完全没问题。请随时告诉我你的需求,我会尽力满足。
谢谢!祝你有个愉快的一天!
如果你有任何问题或需要帮助,请随时告诉我,我会尽力帮助你。
如果你只是想测试系统,或者想要一个空的回复,那也完全没问题。请随时告诉我你的需求,我会尽力满足。
谢谢!祝你有个愉快的一天!
如果你有任何问题或需要帮助,请随时告诉我,我会尽力帮助你。
如果你只是想测试系统,或者想要一个空的回复,那也完全没问题。请随时告诉我你的需求,我会尽力满足。
...
如果你只是想测试系统,或者想要一个空的回复,那也完全没问题。请随时告诉我你的需求,我会尽力满足。
谢谢!祝你有个愉快的一天!
如果你有任何问题或需要帮助,请随时告诉我,
env:
vllm==0.3.3 torch==2.1.2 cuda=12.1
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Running on a V100 32GB, the startup script is:
This occurs after multiple rounds of dialogue, where the server continuously outputs empty tokens "" and does not stop normally.
Here is an instance of the request that caused the issue.
After getting stuck, the server actually keeps outputting spaces until it stops at some point. After this, any subsequent requests sent to it also get stuck in the same way.