shujun1992 commented 4 months ago

Your current environment

启动方式：python -m vllm.entrypoints.openai.api_server --model /opt/llm_models/Qwen1.5-32B-Chat-GPTQ-Int4 --quantization gptq --max-model-len 16384 --port 8888 --gpu-memory-utilization 0.7 --tensor-parallel-size 2 --host 0.0.0.0 --served-model-name Qwen1.5-32B-Chat-GPTQ-Int4 --trust-remote-code --enforce-eager --engine-use-ray --worker-use-ray

两张A10，vllm 0.4.0.post1 - vllm 0.5.0 版本都有此问题

🐛 Describe the bug

当调用接口，不传system时，输出卡主了，用户日志输出全是！！！！！

vllm输出日志如下： INFO 06-13 07:32:54 async_llm_engine.py:561] Received request cmpl-ce462c327b014a60936becb6d1bb06ab: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n\n你好啊，你能干什么，11111111111111111111<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16339, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 271, 108386, 103924, 3837, 107809, 108209, 3837, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 151645, 198, 151644, 77091, 198], lora_request: None. (_AsyncLLMEngine pid=15164) INFO 06-13 07:32:54 metrics.py:341] Avg prompt throughput: 8.6 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:32:59 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.9%, CPU KV cache usage: 0.0%. INFO: 127.0.0.1:35176 - "GET /metrics HTTP/1.1" 200 OK (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:09 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.3%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.6%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:19 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.0%, CPU KV cache usage: 0.0%. INFO: 127.0.0.1:35220 - "GET /metrics HTTP/1.1" 200 OK (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.3%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:29 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.7%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.0%, CPU KV cache usage: 0.0%. INFO: 127.0.0.1:35364 - "GET /metrics HTTP/1.1" 200 OK (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:39 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.4%, CPU KV cache usage: 0.0%. (_AsyncLLMEngine pid=15164) INFO 06-13 07:33:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.8%, CPU KV cache usage: 0.0%.

QwertyJack commented 4 months ago

qwen1.5-32b-gptq的hf主页点进去，注意看最上边的提示。

QwertyJack commented 4 months ago

As it goes:

Qwen1.5-32B-Chat-GPTQ-Int4

[!Warning]

🚨 Please do not deploy this model with vLLM temporarily. Instead we advise you to use the AWQ model.

Link: https://huggingface.co/Qwen/Qwen1.5-32B-Chat-GPTQ-Int4

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

vllm-project / vllm

当调用接口，不传system时，输出卡主了，输出全是！！！！！ #5490

Your current environment

🐛 Describe the bug

Qwen1.5-32B-Chat-GPTQ-Int4