When running the online server with a model with an MLP speculator, sending a request that request prompt logprobs causes the server to crash with an AssertionError.
Stacktrace:
Traceback (most recent call last):
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 125, in generate
async for request_output in results_generator:
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 1054, in generate
async for output in await self.add_request(
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 114, in generator
raise result
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
return_value = task.result()
^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 920, in run_engine_loop
result = task.result()
^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 863, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 332, in step_async
output = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/executor/gpu_executor.py", line 170, in execute_model_async
output = await make_async(self.driver_worker.execute_model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/spec_decode/spec_decode_worker.py", line 387, in execute_model
return self._run_no_spec(execute_model_req,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/spec_decode/spec_decode_worker.py", line 481, in _run_no_spec
self.previous_hidden_states.update(
File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/sequence.py", line 1199, in update
assert len(seq_group_metadata_list) == len(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
To Reproduce
Run a server with an MLP speculator, eg. one of IBM's granite models:
Your current environment
Using the latest vLLM off of
main
.🐛 Describe the bug
When running the online server with a model with an MLP speculator, sending a request that request prompt logprobs causes the server to crash with an
AssertionError
.Stacktrace:
To Reproduce
Run a server with an MLP speculator, eg. one of IBM's granite models:
Send an
echo
request with logprobs requested for the prompt tokens: