vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.11k stars 4.55k forks source link

[Usage]: Internvl2 takes forever to generate response #7148

Open vestal-doublekuan opened 3 months ago

vestal-doublekuan commented 3 months ago

Your current environment

PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1

How would you like to use vllm

I used this script to start the server: vllm serve OpenGVLab/InternVL2-8B --trust_remote_code --chat-template examples/template_chatml.jinja

request:

curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "OpenGVLab/InternVL2-8B", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"} ] }'

I got this in the log and the request never finish: INFO 08-05 15:09:11 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:16 logger.py:36] Received request chat-79478648dc964a728b624eae407d9e74: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWho won the world series in 2020?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [92543, 9081, 364, 2770, 657, 395, 11100, 17993, 281, 92542, 364, 92543, 1008, 364, 15315, 2929, 410, 2028, 4169, 435, 262, 638, 638, 345, 92542, 364, 92543, 525, 11353, 364], lora_request: None, prompt_adapter_request: None. INFO 08-05 15:09:16 async_llm_engine.py:173] Added request chat-79478648dc964a728b624eae407d9e74. INFO 08-05 15:09:16 metrics.py:406] Avg prompt throughput: 6.0 tokens/s, Avg generation throughput: 0.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:21 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 77.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 76.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:31 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 75.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 75.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:41 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 75.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:41 async_llm_engine.py:180] Aborted request chat-79478648dc964a728b624eae407d9e74.

youkaichao commented 3 months ago

cc @ywang96 @DarkLight1337

looks like the request runs for many tokens, and then being aborted.

DarkLight1337 commented 3 months ago

cc @Isotr0py

DarkLight1337 commented 3 months ago

@vestal-doublekuan Can you include the command that is used to launch the server?

ywang96 commented 3 months ago

@vestal-doublekuan Please make sure you set stop tokens in the parameters, otherwise the model will generate until the max context length. Feel free to look at the example here https://github.com/vllm-project/vllm/pull/6514#issuecomment-2258716685

The reason is that model repo itself has an issue about this and I don't think vLLM should be responsible for dealing with such issue. If we do decide to deal with this within vLLM, then we will need to somehow add the two tokens into default stop tokens.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!