vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.83k stars 4.11k forks source link

当调用接口,不传system时,输出卡主了,输出全是!!!!! #5490

Open shujun1992 opened 3 months ago

shujun1992 commented 3 months ago

Your current environment

启动方式:python -m vllm.entrypoints.openai.api_server --model /opt/llm_models/Qwen1.5-32B-Chat-GPTQ-Int4 --quantization gptq --max-model-len 16384 --port 8888 --gpu-memory-utilization 0.7 --tensor-parallel-size 2 --host 0.0.0.0 --served-model-name Qwen1.5-32B-Chat-GPTQ-Int4 --trust-remote-code --enforce-eager --engine-use-ray --worker-use-ray

两张A10,vllm 0.4.0.post1 - vllm 0.5.0 版本 都有此问题

🐛 Describe the bug

当调用接口,不传system时,输出卡主了,用户日志输出全是!!!!!

vllm输出日志如下: INFO 06-13 07:32:54 async_llm_engine.py:561] Received request cmpl-ce462c327b014a60936becb6d1bb06ab: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n\n你好啊,你能干什么,11111111111111111111<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16339, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 271, 108386, 103924, 3837, 107809, 108209, 3837, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 151645, 198, 151644, 77091, 198], lora_request: None. (_AsyncLLMEngine pid=15164) INFO 06-13 07:32:54 metrics.py:341] Avg prompt throughput: 8.6 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:32:59 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.9%, CPU KV cache usage: 0.0%. INFO: 127.0.0.1:35176 - "GET /metrics HTTP/1.1" 200 OK (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:09 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.3%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.6%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:19 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.0%, CPU KV cache usage: 0.0%. INFO: 127.0.0.1:35220 - "GET /metrics HTTP/1.1" 200 OK (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.3%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:29 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.7%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.0%, CPU KV cache usage: 0.0%. INFO: 127.0.0.1:35364 - "GET /metrics HTTP/1.1" 200 OK (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:39 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.4%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.8%, CPU KV cache usage: 0.0%.

QwertyJack commented 3 months ago

qwen1.5-32b-gptq的hf主页点进去,注意看最上边的提示。

QwertyJack commented 3 months ago

As it goes:

Qwen1.5-32B-Chat-GPTQ-Int4

[!Warning]

🚨 Please do not deploy this model with vLLM temporarily. Instead we advise you to use the AWQ model.

Link: https://huggingface.co/Qwen/Qwen1.5-32B-Chat-GPTQ-Int4