vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.09k stars 3.82k forks source link

[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered #5371

Open gaye746560359 opened 2 months ago

gaye746560359 commented 2 months ago

Your current environment

The output of `python collect_env.py`

vllm 0.4.3 rtx4090 driver 555.99

🐛 Describe the bug

2024-06-10 13:26:25 Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f1e8f46caf0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f1e847108e0>>) 2024-06-10 13:26:25 handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f1e8f46caf0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f1e847108e0>>)> 2024-06-10 13:26:25 Traceback (most recent call last): 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish 2024-06-10 13:26:25 task.result() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop 2024-06-10 13:26:25 has_requests_in_progress = await asyncio.wait_for( 2024-06-10 13:26:25 File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for 2024-06-10 13:26:25 return fut.result() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step 2024-06-10 13:26:25 request_outputs = await self.engine.step_async() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async 2024-06-10 13:26:25 output = await self.model_executor.execute_model_async( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async 2024-06-10 13:26:25 output = await make_async(self.driver_worker.execute_model 2024-06-10 13:26:25 File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2024-06-10 13:26:25 result = self.fn(*self.args, **self.kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-06-10 13:26:25 return func(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model 2024-06-10 13:26:25 output = self.model_runner.execute_model(seq_group_metadata_list, 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-06-10 13:26:25 return func(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 738, in execute_model 2024-06-10 13:26:25 output = self.model.sample( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 345, in sample 2024-06-10 13:26:25 next_tokens = self.sampler(logits, sampling_metadata) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 2024-06-10 13:26:25 return self._call_impl(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-06-10 13:26:25 return forward_call(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 96, in forward 2024-06-10 13:26:25 sample_results, maybe_sampled_tokens_tensor = _sample( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 655, in _sample 2024-06-10 13:26:25 return _sample_with_torch( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 544, in _sample_with_torch 2024-06-10 13:26:25 sample_results = _random_sample(seq_groups, 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 324, in _random_sample 2024-06-10 13:26:25 random_samples = random_samples.cpu() 2024-06-10 13:26:25 RuntimeError: CUDA error: an illegal memory access was encountered 2024-06-10 13:26:25 CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. 2024-06-10 13:26:25 For debugging consider passing CUDA_LAUNCH_BLOCKING=1. 2024-06-10 13:26:25 Compile withTORCH_USE_CUDA_DSAto enable device-side assertions. 2024-06-10 13:26:25 2024-06-10 13:26:25 2024-06-10 13:26:25 The above exception was the direct cause of the following exception: 2024-06-10 13:26:25 2024-06-10 13:26:25 Traceback (most recent call last): 2024-06-10 13:26:25 File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _raise_exception_on_finish 2024-06-10 13:26:25 raise AsyncEngineDeadError( 2024-06-10 13:26:25 vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. 2024-06-10 13:26:25 ERROR: Exception in ASGI application 2024-06-10 13:26:25 Traceback (most recent call last): 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__ 2024-06-10 13:26:25 await wrap(partial(self.listen_for_disconnect, receive)) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap 2024-06-10 13:26:25 await func() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect 2024-06-10 13:26:25 message = await receive() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 54, in wrapped_receive 2024-06-10 13:26:25 msg = await self.receive() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 553, in receive 2024-06-10 13:26:25 await self.message_event.wait() 2024-06-10 13:26:25 File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait 2024-06-10 13:26:25 await fut 2024-06-10 13:26:25 asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f1e7f94bd60 2024-06-10 13:26:25 2024-06-10 13:26:25 During handling of the above exception, another exception occurred: 2024-06-10 13:26:25 2024-06-10 13:26:25 Traceback (most recent call last): 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 192, in __call__ 2024-06-10 13:26:25 await response(scope, wrapped_receive, send) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__ 2024-06-10 13:26:25 async with anyio.create_task_group() as task_group: 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__ 2024-06-10 13:26:25 raise BaseExceptionGroup( 2024-06-10 13:26:25 exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) 2024-06-10 13:26:25 2024-06-10 13:26:25 During handling of the above exception, another exception occurred: 2024-06-10 13:26:25 2024-06-10 13:26:25 Traceback (most recent call last): 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 87, in collapse_excgroups 2024-06-10 13:26:25 yield 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 190, in __call__ 2024-06-10 13:26:25 async with anyio.create_task_group() as task_group: 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__ 2024-06-10 13:26:25 raise BaseExceptionGroup( 2024-06-10 13:26:25 exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) 2024-06-10 13:26:25 2024-06-10 13:26:25 During handling of the above exception, another exception occurred: 2024-06-10 13:26:25 2024-06-10 13:26:25 Traceback (most recent call last): 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi 2024-06-10 13:26:25 result = await app( # type: ignore[func-returns-value] 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__ 2024-06-10 13:26:25 return await self.app(scope, receive, send) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__ 2024-06-10 13:26:25 await super().__call__(scope, receive, send) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__ 2024-06-10 13:26:25 await self.middleware_stack(scope, receive, send) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__ 2024-06-10 13:26:25 raise exc 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__ 2024-06-10 13:26:25 await self.app(scope, receive, _send) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 189, in __call__ 2024-06-10 13:26:25 with collapse_excgroups(): 2024-06-10 13:26:25 File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__ 2024-06-10 13:26:25 self.gen.throw(typ, value, traceback) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 93, in collapse_excgroups 2024-06-10 13:26:25 raise exc 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap 2024-06-10 13:26:25 await func() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 250, in stream_response 2024-06-10 13:26:25 async for chunk in self.body_iterator: 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 227, in chat_completion_stream_generator 2024-06-10 13:26:25 async for res in result_generator: 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 662, in generate 2024-06-10 13:26:25 async for output in self._process_request( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 769, in _process_request 2024-06-10 13:26:25 raise e 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 765, in _process_request 2024-06-10 13:26:25 async for request_output in stream: 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 80, in __anext__ 2024-06-10 13:26:25 raise result 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish 2024-06-10 13:26:25 task.result() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop 2024-06-10 13:26:25 has_requests_in_progress = await asyncio.wait_for( 2024-06-10 13:26:25 File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for 2024-06-10 13:26:25 return fut.result() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step 2024-06-10 13:26:25 request_outputs = await self.engine.step_async() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async 2024-06-10 13:26:25 output = await self.model_executor.execute_model_async( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async 2024-06-10 13:26:25 output = await make_async(self.driver_worker.execute_model 2024-06-10 13:26:25 File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2024-06-10 13:26:25 result = self.fn(*self.args, **self.kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-06-10 13:26:25 return func(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model 2024-06-10 13:26:25 output = self.model_runner.execute_model(seq_group_metadata_list, 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-06-10 13:26:25 return func(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 738, in execute_model 2024-06-10 13:26:25 output = self.model.sample( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 345, in sample 2024-06-10 13:26:25 next_tokens = self.sampler(logits, sampling_metadata) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 2024-06-10 13:26:25 return self._call_impl(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-06-10 13:26:25 return forward_call(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 96, in forward 2024-06-10 13:26:25 sample_results, maybe_sampled_tokens_tensor = _sample( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 655, in _sample 2024-06-10 13:26:25 return _sample_with_torch( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 544, in _sample_with_torch 2024-06-10 13:26:25 sample_results = _random_sample(seq_groups, 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 324, in _random_sample 2024-06-10 13:26:25 random_samples = random_samples.cpu() 2024-06-10 13:26:25 RuntimeError: CUDA error: an illegal memory access was encountered 2024-06-10 13:26:25 CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. 2024-06-10 13:26:25 For debugging consider passing CUDA_LAUNCH_BLOCKING=1. 2024-06-10 13:26:25 Compile withTORCH_USE_CUDA_DSAto enable device-side assertions. 2024-06-10 13:26:25 2024-06-10 13:32:41 INFO 06-10 05:32:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%. 2024-06-10 13:32:51 INFO 06-10 05:32:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%. 2024-06-10 13:33:01 INFO 06-10 05:33:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%.

DreamGenX commented 1 month ago

Seeing similar segfault with 0.5.0.post1 from time to time:

model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 776, in _process_request
model_server-1  |     |     async for request_output in stream:
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 89, in __anext__
model_server-1  |     |     raise result
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
model_server-1  |     |     await func()
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 250, in stream_response
model_server-1  |     |     async for chunk in self.body_iterator:
model_server-1  |     |   File "/vllm-workspace/src/api_v2.py", line 267, in stream_results
model_server-1  |     |     raise e
model_server-1  |     |   File "/vllm-workspace/src/api_v2.py", line 223, in stream_results
model_server-1  |     |     async for request_output in results_generator:
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 673, in generate
model_server-1  |     |     async for output in self._process_request(
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 780, in _process_request
model_server-1  |     |     raise e
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 776, in _process_request
model_server-1  |     |     async for request_output in stream:
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 89, in __anext__
model_server-1  |     |     raise result
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
model_server-1  |     |     await func()
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 250, in stream_response
model_server-1  |     |     async for chunk in self.body_iterator:
model_server-1  |     |   File "/vllm-workspace/src/api_v2.py", line 267, in stream_results
model_server-1  |     |     raise e
model_server-1  |     |   File "/vllm-workspace/src/api_v2.py", line 223, in stream_results
model_server-1  |     |     async for request_output in results_generator:
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 673, in generate
model_server-1  |     |     async for output in self._process_request(
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 780, in _process_request
model_server-1  |     |     raise e
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 776, in _process_request
model_server-1  |     |     async for request_output in stream:
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 89, in __anext__
model_server-1  |     |     raise result
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
model_server-1  |     |     return_value = task.result()
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
model_server-1  |     |     has_requests_in_progress = await asyncio.wait_for(
model_server-1  |     |   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
model_server-1  |     |     return fut.result()
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
model_server-1  |     |     request_outputs = await self.engine.step_async()
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
model_server-1  |     |     output = await self.model_executor.execute_model_async(
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 166, in execute_model_async
model_server-1  |     |     return await self._driver_execute_model_async(execute_model_req)
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 149, in _driver_execute_model_async
model_server-1  |     |     return await self.driver_exec_model(execute_model_req)
model_server-1  |     |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
model_server-1  |     |     result = self.fn(*self.args, **self.kwargs)
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
model_server-1  |     |     return func(*args, **kwargs)
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
model_server-1  |     |     output = self.model_runner.execute_model(seq_group_metadata_list,
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
model_server-1  |     |     return func(*args, **kwargs)
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 765, in execute_model
model_server-1  |     |     output = self.model.sample(
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 386, in sample
model_server-1  |     |     next_tokens = self.sampler(logits, sampling_metadata)
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
model_server-1  |     |     return self._call_impl(*args, **kwargs)
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
model_server-1  |     |     return forward_call(*args, **kwargs)
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 96, in forward
model_server-1  |     |     sample_results, maybe_sampled_tokens_tensor = _sample(
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 655, in _sample
model_server-1  |     |     return _sample_with_torch(
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 544, in _sample_with_torch
model_server-1  |     |     sample_results = _random_sample(seq_groups,
model_server-1  |     |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 324, in _random_sample
model_server-1  |     |     random_samples = random_samples.cpu()
model_server-1  |     | RuntimeError: CUDA error: device-side assert triggered
model_server-1  |     | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
model_server-1  |     | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
model_server-1  |     | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
model_server-1  |     |
model_server-1  |     +------------------------------------
model_server-1  | [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
model_server-1  | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
model_server-1  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
model_server-1  | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

model_server-1  | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
model_server-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f34c72cf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
model_server-1  | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f34c727fb25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
model_server-1  | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f34c73a7718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
model_server-1  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f347b276e36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
model_server-1  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f347b27af38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
model_server-1  | frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f347b2805ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
model_server-1  | frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f347b28131c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
model_server-1  | frame #7: <unknown function> + 0xdc253 (0x7f34c7b8b253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
model_server-1  | frame #8: <unknown function> + 0x94ac3 (0x7f34c9af4ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
model_server-1  | frame #9: clone + 0x44 (0x7f34c9b85a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
model_server-1  |
model_server-1  | [2024-07-25 18:12:35,893 E 1 7228] logging.cc:108: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
model_server-1  | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
model_server-1  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
model_server-1  | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
model_server-1  |
model_server-1  | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
model_server-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f34c72cf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
model_server-1  | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f34c727fb25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
model_server-1  | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f34c73a7718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
model_server-1  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f347b276e36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
model_server-1  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f347b27af38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
model_server-1  | frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f347b2805ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
model_server-1  | frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f347b28131c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
model_server-1  | frame #7: <unknown function> + 0xdc253 (0x7f34c7b8b253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
model_server-1  | frame #8: <unknown function> + 0x94ac3 (0x7f34c9af4ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
model_server-1  | frame #9: clone + 0x44 (0x7f34c9b85a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
model_server-1  |
model_server-1  | Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
model_server-1  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f34c72cf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
model_server-1  | frame #1: <unknown function> + 0xe32e33 (0x7f347af03e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
model_server-1  | frame #2: <unknown function> + 0xdc253 (0x7f34c7b8b253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
model_server-1  | frame #3: <unknown function> + 0x94ac3 (0x7f34c9af4ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
model_server-1  | frame #4: clone + 0x44 (0x7f34c9b85a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
model_server-1  |
model_server-1  | [2024-07-25 18:12:35,899 E 1 7228] logging.cc:115: Stack trace:
model_server-1  |  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x1033f2a) [0x7f3286731f2a] ray::operator<<()
model_server-1  | /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x1036f72) [0x7f3286734f72] ray::TerminateHandler()
model_server-1  | /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f34c7b5d20c]
model_server-1  | /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f34c7b5d277]
model_server-1  | /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f34c7b5d1fe]
model_server-1  | /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe32ee4) [0x7f347af03ee4] c10d::ProcessGroupNCCL::ncclCommWatchdog()
model_server-1  | /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f34c7b8b253]
model_server-1  | /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f34c9af4ac3]
model_server-1  | /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f34c9b85a04] __clone
model_server-1  |
model_server-1  | *** SIGABRT received at time=1721931155 on cpu 107 ***
model_server-1  | PC: @     0x7f34c9af69fc  (unknown)  pthread_kill
model_server-1  |     @     0x7f34c9aa2520  (unknown)  (unknown)
model_server-1  | [2024-07-25 18:12:35,900 E 1 7228] logging.cc:343: *** SIGABRT received at time=1721931155 on cpu 107 ***
model_server-1  | [2024-07-25 18:12:35,900 E 1 7228] logging.cc:343: PC: @     0x7f34c9af69fc  (unknown)  pthread_kill
model_server-1  | [2024-07-25 18:12:35,900 E 1 7228] logging.cc:343:     @     0x7f34c9aa2520  (unknown)  (unknown)
model_server-1  | Fatal Python error: Aborted
model_server-1  |
model_server-1  |
model_server-1  | Extension modules: ujson, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._
sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, simplejson._speedups, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb
._message, setproctitle, uvloop.loop, ray._raylet, sentencepiece._sentencepiece, PIL._imaging, httptools.parser.parser, httptools.parser.url_parser, websockets.speedups (total: 38)
model_server-1  | [failure_signal_handler.cc : 332] RAW: Signal 11 raised at PC=0x7f34c9a88898 while already in AbslFailureSignalHandler()
model_server-1  | *** SIGSEGV received at time=1721931155 on cpu 107 ***
model_server-1  | PC: @     0x7f34c9a88898  (unknown)  abort
model_server-1  |     @     0x7f34c9aa2520  633865568  (unknown)
model_server-1  |     @     0x7f0ba3ffd640  (unknown)  (unknown)
model_server-1  | [2024-07-25 18:12:35,901 E 1 7228] logging.cc:343: *** SIGSEGV received at time=1721931155 on cpu 107 ***
model_server-1  | [2024-07-25 18:12:35,901 E 1 7228] logging.cc:343: PC: @     0x7f34c9a88898  (unknown)  abort
model_server-1  | [2024-07-25 18:12:35,902 E 1 7228] logging.cc:343:     @     0x7f34c9aa2520  633865568  (unknown)
model_server-1  | [2024-07-25 18:12:35,903 E 1 7228] logging.cc:343:     @     0x7f0ba3ffd640  (unknown)  (unknown)
model_server-1  | Fatal Python error: Segmentation fault