Open vestal-doublekuan opened 3 months ago
cc @ywang96 @DarkLight1337
looks like the request runs for many tokens, and then being aborted.
cc @Isotr0py
@vestal-doublekuan Can you include the command that is used to launch the server?
@vestal-doublekuan Please make sure you set stop tokens in the parameters, otherwise the model will generate until the max context length. Feel free to look at the example here https://github.com/vllm-project/vllm/pull/6514#issuecomment-2258716685
The reason is that model repo itself has an issue about this and I don't think vLLM should be responsible for dealing with such issue. If we do decide to deal with this within vLLM, then we will need to somehow add the two tokens into default stop tokens.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1
How would you like to use vllm
I used this script to start the server: vllm serve OpenGVLab/InternVL2-8B --trust_remote_code --chat-template examples/template_chatml.jinja
request:
I got this in the log and the request never finish: INFO 08-05 15:09:11 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:16 logger.py:36] Received request chat-79478648dc964a728b624eae407d9e74: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWho won the world series in 2020?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [92543, 9081, 364, 2770, 657, 395, 11100, 17993, 281, 92542, 364, 92543, 1008, 364, 15315, 2929, 410, 2028, 4169, 435, 262, 638, 638, 345, 92542, 364, 92543, 525, 11353, 364], lora_request: None, prompt_adapter_request: None. INFO 08-05 15:09:16 async_llm_engine.py:173] Added request chat-79478648dc964a728b624eae407d9e74. INFO 08-05 15:09:16 metrics.py:406] Avg prompt throughput: 6.0 tokens/s, Avg generation throughput: 0.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:21 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 77.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 76.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:31 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 75.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 75.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:41 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 75.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%. INFO 08-05 15:09:41 async_llm_engine.py:180] Aborted request chat-79478648dc964a728b624eae407d9e74.