[Bug]: llama3-405b-fp8 NCCL communication

wangwensuo commented 3 months ago

Your current environment

The output of `python collect_env.py`

```text Your output of `python collect_env.py` here ``` vllm 0.5.4

🐛 Describe the bug

目前在8 * A800上进行推理，vllm推理的40b fp8版本，然后tp=8，一开始能推理，试了几个case以后就通信异常了 INFO 08-22 06:23:22 metrics.py:406] Avg prompt throughput: 36.2 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-22 06:23:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. INFO 08-22 06:23:32 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. INFO 08-22 06:23:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:23:42 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:23:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:24:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:24:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:24:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:24:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. (VllmWorkerProcess pid=161) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=158) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=160) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=164) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=163) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=162) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=159) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second. ERROR 08-22 06:24:42 async_llm_engine.py:663] Engine iteration timed out. This should never happen! ERROR 08-22 06:24:42 async_llm_engine.py:57] Engine background task failed ERROR 08-22 06:24:42 async_llm_engine.py:57] Traceback (most recent call last): ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop ERROR 08-22 06:24:42 async_llmengine.py:57] done, = await asyncio.wait( ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait ERROR 08-22 06:24:42 async_llm_engine.py:57] return await _wait(fs, timeout, return_when, loop) ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait ERROR 08-22 06:24:42 async_llm_engine.py:57] await waiter ERROR 08-22 06:24:42 async_llm_engine.py:57] asyncio.exceptions.CancelledError ERROR 08-22 06:24:42 async_llm_engine.py:57] ERROR 08-22 06:24:42 async_llm_engine.py:57] During handling of the above exception, another exception occurred: ERROR 08-22 06:24:42 async_llm_engine.py:57] ERROR 08-22 06:24:42 async_llm_engine.py:57] Traceback (most recent call last): ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion ERROR 08-22 06:24:42 async_llm_engine.py:57] return_value = task.result() ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop ERROR 08-22 06:24:42 async_llm_engine.py:57] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S): ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in aexit ERROR 08-22 06:24:42 async_llm_engine.py:57] self._do_exit(exc_type) ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit ERROR 08-22 06:24:42 async_llm_engine.py:57] raise asyncio.TimeoutError ERROR 08-22 06:24:42 async_llm_engine.py:57] asyncio.exceptions.TimeoutError ERROR:asyncio:Exception in callback _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37 handle: <Handle _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engineloop done, = await asyncio.wait( File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait return await _wait(fs, timeout, return_when, loop) File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait await waiter asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion return_value = task.result() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in aexit self._do_exit(exc_type) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit raise asyncio.TimeoutError asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 59, in _log_task_completion raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause. INFO 08-22 06:24:42 async_llm_engine.py:181] Aborted request chat-0d19f933a5304a3890a8ffa2c2bc69e9. ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in call await wrap(partial(self.listen_for_disconnect, receive)) File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap await func() File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect message = await receive() File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 553, in receive await self.message_event.wait() File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait await fut asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f45dbd9bf10

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in call return await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app await response(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in call async with anyio.create_task_group() as task_group: File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in aexit raise BaseExceptionGroup( exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) INFO 08-22 06:24:47 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:24:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:25:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:25:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:25:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:25:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. (VllmWorkerProcess pid=162) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=163) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=164) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=160) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=161) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=158) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=159) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second. INFO 08-22 06:25:47 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:25:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:26:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:26:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:26:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:26:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. (VllmWorkerProcess pid=158) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=163) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=162) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=161) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=160) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=164) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=159) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second. INFO 08-22 06:26:47 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:26:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:27:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:27:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:27:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. INFO 08-22 06:27:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. (VllmWorkerProcess pid=163) WARNING 08-22 06:27:42 shm_broadcast.py:386] No available block found in 60 second. (VllmWorkerProcess pid=162) WARNING 08-22 06:27:42 shm_broadcast.py:386] No available block found in 60 second. [rank1]:[E822 06:33:42.457707445 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600073 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4640d588d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4640d5f313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4640d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600073 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4640d588d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4640d5f313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4640d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0xe5aa84 (0x7f46409eaa84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E822 06:33:42.472676771 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600089 milliseconds before timing out. [rank7]:[E822 06:33:42.472710001 ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600089 milliseconds before timing out. [rank2]:[E822 06:33:42.472732829 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553. [rank2]:[E822 06:33:42.472743993 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 2] Timeout at NCCL work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553. [rank2]:[E822 06:33:42.472748107 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank2]:[E822 06:33:42.472750861 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [rank7]:[E822 06:33:42.472758671 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553. [rank7]:[E822 06:33:42.472765960 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 7] Timeout at NCCL work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553. [rank7]:[E822 06:33:42.472770740 ProcessGroupNCCL.cpp:621] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank7]:[E822 06:33:42.472775007 ProcessGroupNCCL.cpp:627] [Rank 7] To avoid data inconsistency, we are taking the entire process down. [rank6]:[E822 06:33:42.473328595 ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600090 milliseconds before timing out. [rank6]:[E822 06:33:42.473378363 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553. [rank6]:[E822 06:33:42.473386694 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 6] Timeout at NCCL work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553. [rank6]:[E822 06:33:42.473392166 ProcessGroupNCCL.cpp:621] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank6]:[E822 06:33:42.473396865 ProcessGroupNCCL.cpp:627] [Rank 6] To avoid data inconsistency, we are taking the entire process down. [rank7]:[E822 06:33:42.473871273 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600089 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4640d588d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4640d5f313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4640d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

youkaichao commented 3 months ago

can you stably reproduce it?

wangwensuo commented 3 months ago

It's like this every time. After a few inferences, an error is reported, and the GPU usage rate is 100%. Currently, there is no problem with node inference 8b

youkaichao commented 3 months ago

try --disable-custom-all-reduce ?

TangJiakai commented 3 months ago

try --disable-custom-all-reduce ?

Still Not work

vlasenkoalexey commented 2 months ago

Have same issue, happens after model is running for a while.

Startup arguments:

--host=0.0.0.0
--port=7080
--model=gs://...
--tensor-parallel-size=8
--swap-space=16
--gpu-memory-utilization=0.95
--disable-log-stats
--enable-chunked-prefill
--max-num-seqs=12
--max-model-len=131072
--model-type=llama3.1

TTTdas commented 1 month ago

same issue and solved by --disable-custom-all-reduce True

youkaichao commented 1 month ago

@TTTdas I think https://github.com/vllm-project/vllm/pull/8558 should fix it. can you have a try with the latest wheel? see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code for installing the latest wheel.

TTTdas commented 1 month ago

@TTTdas I think #8558 should fix it. can you have a try with the latest wheel? see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code for installing the latest wheel.

yes, it works! thx

vllm-project / vllm

[Bug]: llama3-405b-fp8 NCCL communication #7775

Your current environment

🐛 Describe the bug