vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.17k stars 4.36k forks source link

Automatic Prefix Caching Bug #3193

Closed 78 closed 7 months ago

78 commented 7 months ago

If I enable automatic prefix caching, it occasionally crashes.

Future exception was never retrieved
future: <Future finished exception=RuntimeError('step must be nonzero')>
Traceback (most recent call last):
File "/root/vllm/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
task.result()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 412, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 391, in engine_step
    request_outputs = await self.engine.step_async()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 189, in step_async
    all_outputs = await self._run_workers_async(
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 274, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
  File "/root/miniconda3/envs/vllm/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/vllm/vllm/worker/worker.py", line 223, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/vllm/vllm/worker/model_runner.py", line 575, in execute_model
    lora_mapping) = self.prepare_input_tensors(seq_group_metadata_list)
  File "/root/vllm/vllm/worker/model_runner.py", line 494, in prepare_input_tensors
    lora_requests) = self._prepare_prompt(seq_group_metadata_list)
File "/root/vllm/vllm/worker/model_runner.py", line 243, in _prepare_prompt
start_loc_tensor = torch.arange(0,
RuntimeError: step must be nonzero

Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f87f65c35b0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f87ec4e3fd0>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f87f65c35b0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f87ec4e3fd0>)>
Traceback (most recent call last):
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
    task.result()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 412, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 391, in engine_step
    request_outputs = await self.engine.step_async()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 189, in step_async
    all_outputs = await self._run_workers_async(
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 274, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
  File "/root/miniconda3/envs/vllm/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
ray.exceptions.RayTaskError(KeyError): ray::RayWorkerVllm.execute_method() (pid=1030270, ip=0.0.0.0, actor_id=be1ed7b0fca5fd6227e71c0101000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f5f2f9ad630>)
  File "/root/vllm/vllm/engine/ray_utils.py", line 37, in execute_method
    return executor(*args, **kwargs)
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/vllm/vllm/worker/worker.py", line 212, in execute_model
    num_seq_groups = data["num_seq_groups"]
KeyError: 'num_seq_groups'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    raise exc
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 03-04 20:37:48 async_llm_engine.py:133] Aborted request cmpl-7edf10b340a74b3e8c7c2e07325ae5c6.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 264, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 260, in wrap
    await func()
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 237, in listen_for_disconnect
    message = await receive()
  File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 580, in receive
    await self.message_event.wait()
  File "/root/miniconda3/envs/vllm/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f87bc0d52d0

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
  |     return await self.app(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
  |     raise exc
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
  |     await self.app(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
  |     raise exc
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 758, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 778, in app
  |     await route.handle(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 299, in handle
  |     await self.app(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 79, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
  |     raise exc
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
  |     await response(scope, receive, send)
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 257, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
    |     task.result()
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 412, in run_engine_loop
    |     has_requests_in_progress = await self.engine_step()
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 391, in engine_step
    |     request_outputs = await self.engine.step_async()
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 189, in step_async
    |     all_outputs = await self._run_workers_async(
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 274, in _run_workers_async
    |     all_outputs = await asyncio.gather(*coros)
    |   File "/root/miniconda3/envs/vllm/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    |     return (yield from awaitable.__await__())
    | ray.exceptions.RayTaskError(KeyError): ray::RayWorkerVllm.execute_method() (pid=1030270, ip=0.0.0.0, actor_id=be1ed7b0fca5fd6227e71c0101000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f5f2f9ad630>)
    |   File "/root/vllm/vllm/engine/ray_utils.py", line 37, in execute_method
    |     return executor(*args, **kwargs)
    |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    |     return func(*args, **kwargs)
    |   File "/root/vllm/vllm/worker/worker.py", line 212, in execute_model
    |     num_seq_groups = data["num_seq_groups"]
    | KeyError: 'num_seq_groups'
    |
    | The above exception was the direct cause of the following exception:
    |
    | Traceback (most recent call last):
    |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 260, in wrap
    |     await func()
    |   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 249, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/root/vllm/vllm/entrypoints/openai/serving_chat.py", line 148, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 565, in generate
    |     raise e
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 559, in generate
    |     async for request_output in stream:
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 69, in __anext__
    |     raise result
    |   File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    |     raise exc
    |   File "/root/vllm/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish
    |     raise AsyncEngineDeadError(
    | vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
    +------------------------------------

vLLM: main branch Model: openbuddy-deepseek-67b-v18.1-4k-gptq (Marlin Kernel) GPU: 4 x RTX3090

ywang96 commented 7 months ago

Can confirm similar issues happened to me as well when automatic prefix caching is enabled.

Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f19b986c0d0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f19af5db4f0>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f19b986c0d0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f19af5db4f0>)>
Traceback (most recent call last):
  File "/workspace/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
    task.result()
  File "/workspace/vllm/engine/async_llm_engine.py", line 412, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/workspace/vllm/engine/async_llm_engine.py", line 391, in engine_step
    request_outputs = await self.engine.step_async()
  File "/workspace/vllm/engine/async_llm_engine.py", line 189, in step_async
    all_outputs = await self._run_workers_async(
  File "/workspace/vllm/engine/async_llm_engine.py", line 274, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/worker.py", line 223, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 575, in execute_model
    lora_mapping) = self.prepare_input_tensors(seq_group_metadata_list)
File "/workspace/vllm/worker/model_runner.py", line 494, in prepare_input_tensors
INFO 03-05 09:16:34 async_llm_engine.py:133] Aborted request cmpl-49aa25f0dba24ec7b00d8ae6a0a102ad.
    lora_requests) = self._prepare_prompt(seq_group_metadata_list)
  File "/workspace/vllm/worker/model_runner.py", line 243, in _prepare_prompt
    start_loc_tensor = torch.arange(0,
RuntimeError: step must be nonzero

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/workspace/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    raise exc
  File "/workspace/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

Model: mistralai/Mixtral-8x7B-Instruct-v0.1, 2xA100-80G, cuda graph enabled.

robertgshaw2-neuralmagic commented 7 months ago

Note: @78

Model: openbuddy-deepseek-67b-v18.1-4k-gptq (Marlin Kernel)

This model is not using the marlin kernel

@SageMoore going to take a look