Closed shimizust closed 4 weeks ago
This also happens for the offline LLM entrypoint:
>>> from vllm import LLM
>>> model = LLM("gpt2")
>>> model.generate("")
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][rank0]: Traceback (most recent call last):
[rank0]: File "<stdin>", line 1, in <module>
[rank0]: File "/home/mgoin/code/vllm/vllm/utils.py", line 996, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/home/mgoin/code/vllm/vllm/entrypoints/llm.py", line 339, in generate
[rank0]: outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]: File "/home/mgoin/code/vllm/vllm/entrypoints/llm.py", line 620, in _run_engine
[rank0]: step_outputs = self.llm_engine.step()
[rank0]: File "/home/mgoin/code/vllm/vllm/engine/llm_engine.py", line 1287, in step
[rank0]: 0].schedule()
[rank0]: File "/home/mgoin/code/vllm/vllm/core/scheduler.py", line 963, in schedule
[rank0]: scheduler_outputs = self._schedule()
[rank0]: File "/home/mgoin/code/vllm/vllm/core/scheduler.py", line 938, in _schedule
[rank0]: return self._schedule_default()
[rank0]: File "/home/mgoin/code/vllm/vllm/core/scheduler.py", line 798, in _schedule_default
[rank0]: prefills = self._schedule_prefills(budget,
[rank0]: File "/home/mgoin/code/vllm/vllm/core/scheduler.py", line 696, in _schedule_prefills
[rank0]: num_new_tokens = self._get_num_new_tokens(seq_group,
[rank0]: File "/home/mgoin/code/vllm/vllm/core/scheduler.py", line 1234, in _get_num_new_tokens
[rank0]: assert num_new_tokens > 0
[rank0]: AssertionError
Just curious, I think LLM always starts with a start_of_sentence
token? What does empty prompt mean then?
@youkaichao this depends on the tokenizer. I just tested llama 3.1 8b instruct and it doesn't have this issue because it has a BOS token:
>>> from vllm import LLM
>>> model = LLM("meta-llama/Meta-Llama-3.1-8B-Instruct")
>>> model.generate("")
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7.37it/s, est. speed input: 7.38 toks/s, output: 118.09 toks/s]
[RequestOutput(request_id=0, prompt='', prompt_token_ids=[128000], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='Cloud bathroom mirror is designed to bring a fresh perspective to bathroom essentials. Its fog', token_ids=(16440, 15197, 18327, 374, 6319, 311, 4546, 264, 7878, 13356, 311, 15197, 59886, 13, 11699, 31349), cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1723074368.5084124, last_token_time=1723074368.5084124, first_scheduled_time=1723074368.5245223, first_token_time=1723074368.5398314, time_in_queue=0.016109943389892578, finished_time=1723074368.6597202), lora_request=None)]
You can see the prompt is empty but the prompt token ids is not prompt='', prompt_token_ids=[128000]
Either way, I think we should return an empty response or otherwise follow what openai does for empty prompt. Crashing LLM or the server is not good behavior.
Crashing LLM or the server is not good behavior
agree. we should never let user request crash the engine.
Good catch that this depends on the tokenizer. The models I tested do not have the bos token defined in the tokenizer_config.json.
Same as this: https://github.com/vllm-project/vllm/issues/7632
Thanks for ping, closing as resolved
Your current environment
🐛 Describe the bug
Spin up the vllm server in a pod using the vllm base image (
vllm/vllm-openai:v0.5.3.post1
)where $MODEL_PATH points to some model. I've tried gpt2-medium and Meta-Llama-3-8B.
Generation works fine, but if you pass in an empty prompt, it immediately kills the server and is unrecoverable:
Expected Behavior
If an empty prompt is not allowed, I would expect a 400 invalid input response vs. a 500 that stops the server.