vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.31k stars 4.39k forks source link

[Bug]: Requesting Prompt Logprobs with an MLP Speculator Crashes the Server #7742

Closed tjohnson31415 closed 1 month ago

tjohnson31415 commented 2 months ago

Your current environment

Using the latest vLLM off of main.

🐛 Describe the bug

When running the online server with a model with an MLP speculator, sending a request that request prompt logprobs causes the server to crash with an AssertionError.

Stacktrace:

Traceback (most recent call last):
  File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 125, in generate
    async for request_output in results_generator:
  File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 1054, in generate
    async for output in await self.add_request(
  File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 114, in generator
    raise result
  File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 920, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 863, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 332, in step_async
    output = await self.model_executor.execute_model_async(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/executor/gpu_executor.py", line 170, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/spec_decode/spec_decode_worker.py", line 387, in execute_model
    return self._run_no_spec(execute_model_req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/spec_decode/spec_decode_worker.py", line 481, in _run_no_spec
    self.previous_hidden_states.update(
  File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/sequence.py", line 1199, in update
    assert len(seq_group_metadata_list) == len(hidden_states)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

To Reproduce

Run a server with an MLP speculator, eg. one of IBM's granite models:

vllm serve ibm-granite/granite-3b-code-instruct --speculative-model ibm-granite/granite-3b-code-instruct-accelerator --use-v2-block-manager --enforce-eager

Send an echo request with logprobs requested for the prompt tokens:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "ibm-granite/granite-3b-code-instruct",
      "prompt": "Hello World",
      "echo": 1,
      "logprobs": 1,
      "temperature": 0
  }'
tjohnson31415 commented 2 months ago

I wanted to create an issue to describe the crash and the reproduction, but I am also investigating a fix for this and will push up a PR soon.