vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.22k stars 4.74k forks source link

[Bug]: Crash with num-scheduler-steps > 1 and response_format type json object #8985

Closed warlock135 closed 1 month ago

warlock135 commented 2 months ago

Your current environment

vllm container v0.6.2 (vllm/vllm-openai:v0.6.2) Models: LLama-3-70b-Instruct, LLama-3-8b-Instruct, Qwen-2.5-32b-Instruct GPUs: A100, A30

Model Input Dumps

No response

πŸ› Describe the bug

When using --num-scheduler-steps 8 and request with "response_format": { "type": "json_object" }, vllm raise an error and crash after that. The error log:

Compiling FSM index for all state transitions: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:02<00:00,  1.14it/s]
INFO 10-01 02:39:08 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
ERROR 10-01 02:39:08 engine.py:157] AssertionError('Logits Processors are not supported in multi-step decoding')
ERROR 10-01 02:39:08 engine.py:157] Traceback (most recent call last):
ERROR 10-01 02:39:08 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 155, in start
ERROR 10-01 02:39:08 engine.py:157]     self.run_engine_loop()
ERROR 10-01 02:39:08 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop
ERROR 10-01 02:39:08 engine.py:157]     request_outputs = self.engine_step()
ERROR 10-01 02:39:08 engine.py:157]                       ^^^^^^^^^^^^^^^^^^
ERROR 10-01 02:39:08 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 236, in engine_step
ERROR 10-01 02:39:08 engine.py:157]     raise e
ERROR 10-01 02:39:08 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 227, in engine_step
ERROR 10-01 02:39:08 engine.py:157]     return self.engine.step()
ERROR 10-01 02:39:08 engine.py:157]            ^^^^^^^^^^^^^^^^^^
ERROR 10-01 02:39:08 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1264, in step
ERROR 10-01 02:39:08 engine.py:157]     outputs = self.model_executor.execute_model(
ERROR 10-01 02:39:08 engine.py:157]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-01 02:39:08 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 130, in execute_model
ERROR 10-01 02:39:08 engine.py:157]     output = self.driver_worker.execute_model(execute_model_req)
ERROR 10-01 02:39:08 engine.py:157]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-01 02:39:08 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 10-01 02:39:08 engine.py:157]     output = self.model_runner.execute_model(
ERROR 10-01 02:39:08 engine.py:157]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-01 02:39:08 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-01 02:39:08 engine.py:157]     return func(*args, **kwargs)
ERROR 10-01 02:39:08 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 10-01 02:39:08 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/multi_step_model_runner.py", line 458, in execute_model
ERROR 10-01 02:39:08 engine.py:157]     outputs = self._final_process_outputs(model_input,
ERROR 10-01 02:39:08 engine.py:157]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-01 02:39:08 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/multi_step_model_runner.py", line 312, in _final_process_outputs
ERROR 10-01 02:39:08 engine.py:157]     output.pythonize(model_input, self._copy_stream,
ERROR 10-01 02:39:08 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/multi_step_model_runner.py", line 87, in pythonize
ERROR 10-01 02:39:08 engine.py:157]     self._pythonize_sampler_output(input_metadata, copy_stream,
ERROR 10-01 02:39:08 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/multi_step_model_runner.py", line 117, in _pythonize_sampler_output
ERROR 10-01 02:39:08 engine.py:157]     _pythonize_sampler_output(input_metadata, self.sampler_output,
ERROR 10-01 02:39:08 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/multi_step_model_runner.py", line 687, in _pythonize_sampler_output
ERROR 10-01 02:39:08 engine.py:157]     assert len(seq_group.sampling_params.logits_processors) == 0, (
ERROR 10-01 02:39:08 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-01 02:39:08 engine.py:157] AssertionError: Logits Processors are not supported in multi-step decoding
INFO:     172.16.22.167:36690 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 73, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 328, in create_completion
    generator = await completion(raw_request).create_completion(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 187, in create_completion
    async for i, res in result_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 490, in merge_async_iterators
    item = await d
           ^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 486, in _process_request
    raise request_output
AssertionError: Logits Processors are not supported in multi-step decoding
ERROR 10-01 02:39:16 client.py:244] TimeoutError('No heartbeat received from MQLLMEngine')
ERROR 10-01 02:39:16 client.py:244] NoneType: None
CRITICAL 10-01 02:41:25 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     172.16.22.167:40072 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1]

Changing response_format type to text or removing num-scheduler-steps and everything works fine

Before submitting a new issue...

devonthomas35 commented 1 month ago

Hit this same exception

hqzhon commented 1 month ago

Hit this same exception

joerunde commented 1 month ago

The docs here show that guided decoding and multi-step don't work together yet: https://docs.vllm.ai/en/latest/serving/compatibility_matrix.html

warlock135 commented 1 month ago

The docs here show that guided decoding and multi-step don't work together yet: https://docs.vllm.ai/en/latest/serving/compatibility_matrix.html

I think the appropriate behavior here should be to response with an HTTP status code other than 200 (e.g., 400, 500, 501) and continue operating, rather than using an assertion that crashes the engine. The current behavior prevents the service from being exposed directly to clients with multi-steps enabled, as a single request could bring it down.

joerunde commented 1 month ago

Ah, yeah I also agree with that, that's a common problem in vLLM. Maybe I can find some time to make this 400 instead