Closed blackblue9 closed 1 day ago
looks like the same issue with https://github.com/vllm-project/vllm/issues/4135 emerged after 0.4.0
Having the same error with Mixtral-8x7B-Instruct-v0.1-GPTQ and tensor_parallel_size=2
INFO 04-25 09:50:28 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-25 09:50:28 api_server.py:150] args: Namespace(host=None, port=8001, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', max_model_len=5000, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.8, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization='gptq', enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 04-25 09:50:29 config.py:767] Casting torch.bfloat16 to torch.float16.
WARNING 04-25 09:50:29 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-25 09:51:01 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=5000, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 04-25 09:51:20 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO 04-25 09:51:20 selector.py:25] Using XFormers backend.
[36m(RayWorkerVllm pid=233183)[0m INFO 04-25 09:51:21 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs.
[36m(RayWorkerVllm pid=233183)[0m INFO 04-25 09:51:21 selector.py:25] Using XFormers backend.
INFO 04-25 09:51:33 pynccl_utils.py:45] vLLM is using nccl==2.18.1
[36m(RayWorkerVllm pid=233183)[0m INFO 04-25 09:51:33 pynccl_utils.py:45] vLLM is using nccl==2.18.1
INFO 04-25 09:52:18 custom_all_reduce.py:137] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped
[36m(RayWorkerVllm pid=233183)[0m INFO 04-25 09:52:18 custom_all_reduce.py:137] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped
INFO 04-25 09:52:23 weight_utils.py:177] Using model weights format ['*.safetensors']
[36m(RayWorkerVllm pid=233183)[0m INFO 04-25 09:52:24 weight_utils.py:177] Using model weights format ['*.safetensors']
INFO 04-25 09:53:12 model_runner.py:104] Loading model weights took 11.0906 GB
I am awake
time mem processes process usage
(secs) (MB) tot actv (sorted, %CPU)
[36m(RayWorkerVllm pid=233183)[0m INFO 04-25 09:53:12 model_runner.py:104] Loading model weights took 11.0906 GB
INFO 04-25 09:54:42 ray_gpu_executor.py:240] # GPU blocks: 12524, # CPU blocks: 4096
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x2aad73314ae0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x2aad836e63d0>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x2aad73314ae0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x2aad836e63d0>>)>
Traceback (most recent call last):
File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 490, in wait_for
return fut.result()
^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 454, in engine_step
request_outputs = await self.engine.step_async()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async
output = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async
all_outputs = await self._run_workers_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async
all_outputs = await asyncio.gather(*coros)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
task.result()
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 480, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 492, in wait_for
raise exceptions.TimeoutError() from exc
TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 490, in wait_for
return fut.result()
^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 454, in engine_step
request_outputs = await self.engine.step_async()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async
output = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async
all_outputs = await self._run_workers_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async
all_outputs = await asyncio.gather(*coros)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_completion
generator = await openai_serving_completion.create_completion(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion
async for i, res in result_generator:
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 81, in consumer
raise item
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 66, in producer
async for item in iterator:
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 644, in generate
raise e
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 638, in generate
async for request_output in stream:
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 77, in __anext__
raise result
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
task.result()
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 480, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 492, in wait_for
raise exceptions.TimeoutError() from exc
TimeoutError
Having the same error with Mixtral-8x7B-Instruct-v0.1-GPTQ and tensor_parallel_size=2
INFO 04-25 09:50:28 api_server.py:149] vLLM API server version 0.4.0.post1 INFO 04-25 09:50:28 api_server.py:150] args: Namespace(host=None, port=8001, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', max_model_len=5000, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.8, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization='gptq', enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 04-25 09:50:29 config.py:767] Casting torch.bfloat16 to torch.float16. WARNING 04-25 09:50:29 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 04-25 09:51:01 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=5000, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) INFO 04-25 09:51:20 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs. INFO 04-25 09:51:20 selector.py:25] Using XFormers backend. �[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:51:21 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs. �[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:51:21 selector.py:25] Using XFormers backend. INFO 04-25 09:51:33 pynccl_utils.py:45] vLLM is using nccl==2.18.1 �[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:51:33 pynccl_utils.py:45] vLLM is using nccl==2.18.1 INFO 04-25 09:52:18 custom_all_reduce.py:137] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped �[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:52:18 custom_all_reduce.py:137] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped INFO 04-25 09:52:23 weight_utils.py:177] Using model weights format ['*.safetensors'] �[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:52:24 weight_utils.py:177] Using model weights format ['*.safetensors'] INFO 04-25 09:53:12 model_runner.py:104] Loading model weights took 11.0906 GB I am awake time mem processes process usage (secs) (MB) tot actv (sorted, %CPU) �[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:53:12 model_runner.py:104] Loading model weights took 11.0906 GB INFO 04-25 09:54:42 ray_gpu_executor.py:240] # GPU blocks: 12524, # CPU blocks: 4096
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x2aad73314ae0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x2aad836e63d0>>) handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x2aad73314ae0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x2aad836e63d0>>)> Traceback (most recent call last): File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 490, in wait_for return fut.result() ^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 454, in engine_step request_outputs = await self.engine.step_async() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async output = await self.model_executor.execute_model_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async all_outputs = await self._run_workers_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async all_outputs = await asyncio.gather(*coros) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish task.result() File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 480, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( ^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 492, in wait_for raise exceptions.TimeoutError() from exc TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. ERROR: Exception in ASGI application Traceback (most recent call last): File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 490, in wait_for return fut.result() ^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 454, in engine_step request_outputs = await self.engine.step_async() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async output = await self.model_executor.execute_model_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async all_outputs = await self._run_workers_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async all_outputs = await asyncio.gather(*coros) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi result = await app( # type: ignore[func-returns-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__ return await self.app(scope, receive, send) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__ await super().__call__(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__ await self.middleware_stack(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__ raise exc File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__ await self.app(scope, receive, _send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__ await self.app(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__ await self.middleware_stack(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 72, in app response = await func(request) ^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_completion generator = await openai_serving_completion.create_completion( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion async for i, res in result_generator: File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 81, in consumer raise item File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 66, in producer async for item in iterator: File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 644, in generate raise e File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 638, in generate async for request_output in stream: File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 77, in __anext__ raise result File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish task.result() File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 480, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( ^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 492, in wait_for raise exceptions.TimeoutError() from exc TimeoutError
Would you kindly share the specifications of the GPU you utilized while encountering these issues? Also A800-80G?
@ericzhou571 @JPonsa @blackblue9 @supdizh Could you try --disable-custom-all-reduce
when you launch the server and see if this issue persists?
@ywang96 the issue persists when launching the server with --disable-custom-all-reduce
I encountered a similar issue in version 0.4.2.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!
Your current environment
🐛 Describe the bug
I'm doing this via python -m vllm.entrypoints.openai.api_server --port 7801 --host 0.0.0.0 --model /mnt/model/llama3-70B-instruct --served-model-name vllm_llama3_70B_instruct --tensor-parallel- size 4 --trust-remote-code After deploying llama3, we started to perform performance tests on the model concurrently. The model could reply normally at the beginning, but about 10 minutes into the test, vllm reported the following error (I installed vllm through the source code 0.4.1):
` RROR 04-23 16:19:04 async_llm_engine.py:499] Engine iteration timed out. This should never happen! ERROR 04-23 16:19:04 async_llm_engine.py:43] Engine background task failed ERROR 04-23 16:19:04 async_llm_engine.py:43] Traceback (most recent call last): ERROR 04-23 16:19:04 async_llm_engine.py:43] File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 470, in engine_step ERROR 04-23 16:19:04 async_llm_engine.py:43] request_outputs = await self.engine.step_async() ERROR 04-23 16:19:04 async_llm_engine.py:43] File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async ERROR 04-23 16:19:04 async_llm_engine.py:43] output = await self.model_executor.execute_model_async( ERROR 04-23 16:19:04 async_llm_engine.py:43] File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 424, in execute_model_async ERROR 04-23 16:19:04 async_llm_engine.py:43] all_outputs = await self._run_workers_async( ERROR 04-23 16:19:04 async_llm_engine.py:43] File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 414, in _run_workers_async ERROR 04-23 16:19:04 async_llm_engine.py:43] all_outputs = await asyncio.gather(coros) ERROR 04-23 16:19:04 async_llm_engine.py:43] asyncio.exceptions.CancelledError ERROR 04-23 16:19:04 async_llm_engine.py:43] ERROR 04-23 16:19:04 async_llm_engine.py:43] During handling of the above exception, another exception occurred: ERROR 04-23 16:19:04 async_llm_engine.py:43] ERROR 04-23 16:19:04 async_llm_engine.py:43] Traceback (most recent call last): ERROR 04-23 16:19:04 async_llm_engine.py:43] File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/asyncio/tasks.py", line 456, in wait_for ERROR 04-23 16:19:04 async_llm_engine.py:43] return fut.result() ERROR 04-23 16:19:04 async_llm_engine.py:43] asyncio.exceptions.CancelledError ERROR 04-23 16:19:04 async_llm_engine.py:43] ERROR 04-23 16:19:04 async_llm_engine.py:43] The above exception was the direct cause of the following exception: ERROR 04-23 16:19:04 async_llm_engine.py:43] ERROR 04-23 16:19:04 async_llm_engine.py:43] Traceback (most recent call last): ERROR 04-23 16:19:04 async_llm_engine.py:43] File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish ERROR 04-23 16:19:04 async_llm_engine.py:43] task.result() ERROR 04-23 16:19:04 async_llm_engine.py:43] File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 496, in run_engine_loop ERROR 04-23 16:19:04 async_llm_engine.py:43] has_requests_in_progress = await asyncio.wait_for( ERROR 04-23 16:19:04 async_llm_engine.py:43] File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/asyncio/tasks.py", line 458, in wait_for ERROR 04-23 16:19:04 async_llm_engine.py:43] raise exceptions.TimeoutError() from exc ERROR 04-23 16:19:04 async_llm_engine.py:43] asyncio.exceptions.TimeoutError ERROR:asyncio:Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f4124ad08b0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f412c39bac0>>) handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f4124ad08b0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f412c39bac0>>)> Traceback (most recent call last): File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 470, in engine_step request_outputs = await self.engine.step_async() File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async output = await self.model_executor.execute_model_async( File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 424, in execute_model_async all_outputs = await self._run_workers_async( File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 414, in _run_workers_async all_outputs = await asyncio.gather(coros) asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish task.result() File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 496, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. INFO 04-23 16:19:04 async_llm_engine.py:154] Aborted request cmpl-cb9ae6d5b74b48a28f23d9f4c323a104. ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 470, in engine_step request_outputs = await self.engine.step_async() File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async output = await self.model_executor.execute_model_async( File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 424, in execute_model_async all_outputs = await self._run_workers_async( File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 414, in _run_workers_async all_outputs = await asyncio.gather(*coros) asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in call return await self.app(scope, receive, send) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in call await self.app(scope, receive, send) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/routing.py", line 758, in call await self.middleware_stack(scope, receive, send) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/routing.py", line 778, in app await route.handle(scope, receive, send) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/routing.py", line 299, in handle await self.app(scope, receive, send) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/routing.py", line 79, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/starlette/routing.py", line 74, in app response = await func(request) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app raise e File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app raw_response = await run_endpoint_function( File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(*values) File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 89, in create_chat_completion generator = await openai_serving_chat.create_chat_completion( File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 95, in create_chat_completion return await self.chat_completion_full_generator( File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 258, in chat_completion_full_generator async for res in result_generator: File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 661, in generate raise e File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 655, in generate async for request_output in stream: File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 77, in anext raise result File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish task.result() File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 496, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError INFO 04-23 16:19:04 async_llm_engine.py:154] Aborted request cmpl-ca47d0961c59407ab90792f513566cbd. ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 470, in engine_step request_outputs = await self.engine.step_async() File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async output = await self.model_executor.execute_model_async( File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 424, in execute_model_async all_outputs = await self._run_workers_async( File "/usr/local/miniconda3/envs/vllm_llama3/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 414, in _run_workers_async all_outputs = await asyncio.gather(coros) asyncio.exceptions.CancelledError
`
How should I solve this problem? ?