Open thomZ1 opened 4 months ago
ERROR 07-10 11:08:14 async_llm_engine.py:483] Engine iteration timed out. This should never happen! ERROR 07-10 11:08:14 async_llm_engine.py:43] Engine background task failed ERROR 07-10 11:08:14 async_llm_engine.py:43] Traceback (most recent call last): ERROR 07-10 11:08:14 async_llm_engine.py:43] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 454, in engine_step ERROR 07-10 11:08:14 async_llm_engine.py:43] request_outputs = await self.engine.step_async() ERROR 07-10 11:08:14 async_llm_engine.py:43] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async ERROR 07-10 11:08:14 async_llm_engine.py:43] output = await self.model_executor.execute_model_async( ERROR 07-10 11:08:14 async_llm_engine.py:43] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async ERROR 07-10 11:08:14 async_llm_engine.py:43] all_outputs = await self._run_workers_async( ERROR 07-10 11:08:14 async_llm_engine.py:43] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async ERROR 07-10 11:08:14 async_llm_engine.py:43] all_outputs = await asyncio.gather(coros) ERROR 07-10 11:08:14 async_llm_engine.py:43] asyncio.exceptions.CancelledError ERROR 07-10 11:08:14 async_llm_engine.py:43] ERROR 07-10 11:08:14 async_llm_engine.py:43] During handling of the above exception, another exception occurred: ERROR 07-10 11:08:14 async_llm_engine.py:43] ERROR 07-10 11:08:14 async_llm_engine.py:43] Traceback (most recent call last): ERROR 07-10 11:08:14 async_llm_engine.py:43] File "/home/ray/anaconda3/lib/python3.10/asyncio/tasks.py", line 456, in wait_for ERROR 07-10 11:08:14 async_llm_engine.py:43] return fut.result() ERROR 07-10 11:08:14 async_llm_engine.py:43] asyncio.exceptions.CancelledError ERROR 07-10 11:08:14 async_llm_engine.py:43] ERROR 07-10 11:08:14 async_llm_engine.py:43] The above exception was the direct cause of the following exception: ERROR 07-10 11:08:14 async_llm_engine.py:43] ERROR 07-10 11:08:14 async_llm_engine.py:43] Traceback (most recent call last): ERROR 07-10 11:08:14 async_llm_engine.py:43] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish ERROR 07-10 11:08:14 async_llm_engine.py:43] task.result() ERROR 07-10 11:08:14 async_llm_engine.py:43] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 480, in run_engine_loop ERROR 07-10 11:08:14 async_llm_engine.py:43] has_requests_in_progress = await asyncio.wait_for( ERROR 07-10 11:08:14 async_llm_engine.py:43] File "/home/ray/anaconda3/lib/python3.10/asyncio/tasks.py", line 458, in wait_for ERROR 07-10 11:08:14 async_llm_engine.py:43] raise exceptions.TimeoutError() from exc ERROR 07-10 11:08:14 async_llm_engine.py:43] asyncio.exceptions.TimeoutError 2024-07-10 11:08:14,573 - asyncio - ERROR - Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7fcf686829e0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fd15e5da680>>) handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7fcf686829e0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fd15e5da680>>)> Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 454, in engine_step request_outputs = await self.engine.step_async() File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async output = await self.model_executor.execute_model_async( File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async all_outputs = await self._run_workers_async( File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async all_outputs = await asyncio.gather(coros) asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish task.result() File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 480, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( File "/home/ray/anaconda3/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
+1
+1
+1
We have a tracking issue (#5901) for this. Please provide more details there so we can better troubleshoot the underlying cause.
when i used the glm4-9b-int8 and qwen2-72b-int4, i met this problem too.
me too.Once it happen "Engine iteration timed out. This should never happen!", all the server will never response.
We encountered the same error ("Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered... Engine iteration timed out. This should never happen!") multiple times across v0.4.3, v0.5.5, and v0.6.1 post2, specifically with the A800-80G * 4 setup, tp=4 configuration.
In v0.4.3, using --disable-custom-all-reduce resolved the issue. However, in v0.5.5 and v0.6.1 post2, this flag no longer works.
as mentioned in the discussion here: https://github.com/vllm-project/vllm/issues/8230 @Sekri0 highlighted that --enable-prefix-caching could cause the CUDA illegal memory access error, with the traceback pointing to FlashAttention as a potential source. Although pull requests 7018 and 7142 appeared to address this issue, it persists in vLLM 0.5.5. Given that Flash Attention directly manages and accesses GPU memory, this detailed control could indeed increase the likelihood of encountering illegal memory access errors, especially if Flash Attention is not configured properly.
Based on these clues, we decided to uninstall Flash Attention and switch to xformers. After doing so, we successfully processed thousands of requests with tp=4, without any errors. We also tested aborting hundreds of requests abruptly and resending them in loops. The server remained robust throughout. While the speed is slightly slower compared to Flash Attention (a difference of a few hundred milliseconds), the system stability has significantly improved.
If you're facing similar issues, Maybe uninstall Flash Attention and use xformers instead will help.
Your current environment
🐛 Describe the bug
I am using the eval-scope to test the concurrent throughput of the Qwen2 72B Instruct model deployed with VLLM. When running with 8 concurrent sessions, inputting 8k tokens, and outputting 2k tokens for a period of time, the VLLM service becomes inaccessible.
https://github.com/modelscope/eval-scope/tree/main: