Closed 78 closed 7 months ago
Can confirm similar issues happened to me as well when automatic prefix caching is enabled.
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f19b986c0d0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f19af5db4f0>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f19b986c0d0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f19af5db4f0>)>
Traceback (most recent call last):
File "/workspace/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
task.result()
File "/workspace/vllm/engine/async_llm_engine.py", line 412, in run_engine_loop
has_requests_in_progress = await self.engine_step()
File "/workspace/vllm/engine/async_llm_engine.py", line 391, in engine_step
request_outputs = await self.engine.step_async()
File "/workspace/vllm/engine/async_llm_engine.py", line 189, in step_async
all_outputs = await self._run_workers_async(
File "/workspace/vllm/engine/async_llm_engine.py", line 274, in _run_workers_async
all_outputs = await asyncio.gather(*coros)
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/vllm/worker/worker.py", line 223, in execute_model
output = self.model_runner.execute_model(seq_group_metadata_list,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/vllm/worker/model_runner.py", line 575, in execute_model
lora_mapping) = self.prepare_input_tensors(seq_group_metadata_list)
File "/workspace/vllm/worker/model_runner.py", line 494, in prepare_input_tensors
INFO 03-05 09:16:34 async_llm_engine.py:133] Aborted request cmpl-49aa25f0dba24ec7b00d8ae6a0a102ad.
lora_requests) = self._prepare_prompt(seq_group_metadata_list)
File "/workspace/vllm/worker/model_runner.py", line 243, in _prepare_prompt
start_loc_tensor = torch.arange(0,
RuntimeError: step must be nonzero
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/workspace/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
raise exc
File "/workspace/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
Model: mistralai/Mixtral-8x7B-Instruct-v0.1
, 2xA100-80G, cuda graph enabled.
Note: @78
Model: openbuddy-deepseek-67b-v18.1-4k-gptq (Marlin Kernel)
This model is not using the marlin kernel
@SageMoore going to take a look
If I enable automatic prefix caching, it occasionally crashes.
vLLM: main branch Model: openbuddy-deepseek-67b-v18.1-4k-gptq (Marlin Kernel) GPU: 4 x RTX3090