vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.73k stars 3.91k forks source link

[Usage]: Internvl2 takes forever to generate response #7148

Open vestal-doublekuan opened 1 month ago

vestal-doublekuan commented 1 month ago

Your current environment

PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1

How would you like to use vllm

I used this script to start the server: vllm serve OpenGVLab/InternVL2-8B --trust_remote_code --chat-template examples/template_chatml.jinja

request:

curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "OpenGVLab/InternVL2-8B", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"} ] }'

I got this in the log and the request never finish: INFO 08-05 15:09:11 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:16 logger.py:36] Received request chat-79478648dc964a728b624eae407d9e74: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWho won the world series in 2020?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [92543, 9081, 364, 2770, 657, 395, 11100, 17993, 281, 92542, 364, 92543, 1008, 364, 15315, 2929, 410, 2028, 4169, 435, 262, 638, 638, 345, 92542, 364, 92543, 525, 11353, 364], lora_request: None, prompt_adapter_request: None. INFO 08-05 15:09:16 async_llm_engine.py:173] Added request chat-79478648dc964a728b624eae407d9e74. INFO 08-05 15:09:16 metrics.py:406] Avg prompt throughput: 6.0 tokens/s, Avg generation throughput: 0.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:21 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 77.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 76.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:31 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 75.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 75.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:41 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 75.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:41 async_llm_engine.py:180] Aborted request chat-79478648dc964a728b624eae407d9e74.

youkaichao commented 1 month ago

cc @ywang96 @DarkLight1337

looks like the request runs for many tokens, and then being aborted.

DarkLight1337 commented 1 month ago

cc @Isotr0py

DarkLight1337 commented 1 month ago

@vestal-doublekuan Can you include the command that is used to launch the server?

ywang96 commented 1 month ago

@vestal-doublekuan Please make sure you set stop tokens in the parameters, otherwise the model will generate until the max context length. Feel free to look at the example here https://github.com/vllm-project/vllm/pull/6514#issuecomment-2258716685

The reason is that model repo itself has an issue about this and I don't think vLLM should be responsible for dealing with such issue. If we do decide to deal with this within vLLM, then we will need to somehow add the two tokens into default stop tokens.