[Bug]: call for stack trace for "Watchdog caught collective operation timeout"

youkaichao commented 6 days ago

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

We received quite a lot report about "Watchdog caught collective operation timeout", which is flaky and difficult to reproduce. It happens after running for some time.

To analyze the error, we need to collect enough stack trace. If you encounter a similar problem, please paste enough stack trace for us to debug.

Example: https://buildkite.com/vllm/ci-aws/builds/3548#01906e81-54c6-4713-beb7-d08f3c873200 caught one such error.

Please include the first line of error, together with the Python stack trace.

In the following example, it seems one process has illegal memory access. It dies, but the rest process is still in allreduce, and is waiting for it, causing the timeout problem. From the python level stack trace, it happens when we profile the run, and it seems to be related with moe layer.

[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
--
  | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
  |  
  | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd5c7e42b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd5c7f6a718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd57bc4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd57bc4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fd57bc545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd57bc5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #7: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #8: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #9: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | terminate called after throwing an instance of 'c10::DistBackendError'
  | what():  [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
  | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
  |  
  | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd5c7e42b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd5c7f6a718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd57bc4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd57bc4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fd57bc545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd57bc5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #7: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #8: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #9: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: <unknown function> + 0xe32e33 (0x7fd57b8d7e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #3: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #4: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | ERROR 07-01 13:54:43 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 1159 died, exit code: -6
  | INFO 07-01 13:54:43 multiproc_worker_utils.py:123] Killing local vLLM worker processes
  | [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 0] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
  | [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
  | [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9e8b64f1b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9e8b653fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9e8b65531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #5: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #6: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | terminate called after throwing an instance of 'c10::DistBackendError'
  | what():  [PG 2 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9e8b64f1b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9e8b653fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9e8b65531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #5: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #6: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: <unknown function> + 0xe32e33 (0x7f9e8b2d7e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #3: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #4: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Fatal Python error: Aborted
  |  
  | Thread 0x00007f96975fe640 (most recent call first):
  | File "/usr/lib/python3.10/threading.py", line 320 in wait
  | File "/usr/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
  | File "/usr/lib/python3.10/threading.py", line 953 in run
  | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
  |  
  | Thread 0x00007f96acbfd640 (most recent call first):
  | File "/usr/lib/python3.10/threading.py", line 324 in wait
  | File "/usr/lib/python3.10/threading.py", line 607 in wait
  | File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
  |  
  | Thread 0x00007f9f8214e480 (most recent call first):
  | File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 357 in topk_softmax
  | File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 34 in wrapper
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 357 in fused_topk
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 549 in fused_moe
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 273 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 426 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 470 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 540 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1135 in execute_model
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 814 in profile_run
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 175 in determine_num_available_blocks
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 122 in _run_workers
  | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 344 in _initialize_kv_caches
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 405 in from_engine_args
  | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 144 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/models/vllm_causallms.py", line 97 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/api/model.py", line 133 in create_from_arg_string
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/evaluator.py", line 164 in simple_evaluate
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/utils.py", line 288 in _wrapper
  | File "/vllm-workspace/.buildkite/lm-eval-harness/test_lm_eval_correctness.py", line 29 in launch_lm_eval
  | File "/vllm-workspace/.buildkite/lm-eval-harness/test_lm_eval_correctness.py", line 45 in test_lm_eval_correctness
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/python.py", line 162 in pytest_pyfunc_call
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/python.py", line 1632 in runtest
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 173 in pytest_runtest_call
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 241 in <lambda>
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 341 in from_call
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 240 in call_and_report
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 135 in runtestprotocol
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 116 in pytest_runtest_protocol
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 364 in pytest_runtestloop
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 339 in _main
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 285 in wrap_session
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 332 in pytest_cmdline_main
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 178 in main
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 206 in console_main
  | File "/usr/local/bin/pytest", line 8 in <module>

oliver-li commented 15 hours ago

[36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=158046, OpType=GATHER, NumelIn=2867200, NumelOut=0, Timeout(ms)=600000) ran for 600027 milliseconds before timing out.
[36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 1] Timeout at NCCL work: 158046, last enqueued NCCL work: 158046, last completed NCCL work: 158045.
[36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=158046, OpType=GATHER, NumelIn=2867200, NumelOut=0, Timeout(ms)=600000) ran for 600027 milliseconds before timing out.
[36m(RayWorkerWrapper pid=4009)[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[36m(RayWorkerWrapper pid=4009)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa2ea16b897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9a81b4c5a2 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9a81b513c0 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9a81b5270c in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #4: <unknown function> + 0x1c220 (0x7fa2ea2b7220 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #5: <unknown function> + 0x7ea5 (0x7fa2fe195ea5 in /usr/lib64/libpthread.so.0)
[36m(RayWorkerWrapper pid=4009)[0m frame #6: clone + 0x6d (0x7fa2fd7b5b0d in /usr/lib64/libc.so.6)
[36m(RayWorkerWrapper pid=4009)[0m 
[36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,171 E 4009 4134] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=158046, OpType=GATHER, NumelIn=2867200, NumelOut=0, Timeout(ms)=600000) ran for 600027 milliseconds before timing out.
[36m(RayWorkerWrapper pid=4009)[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[36m(RayWorkerWrapper pid=4009)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa2ea16b897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9a81b4c5a2 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9a81b513c0 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9a81b5270c in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #4: <unknown function> + 0x1c220 (0x7fa2ea2b7220 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #5: <unknown function> + 0x7ea5 (0x7fa2fe195ea5 in /usr/lib64/libpthread.so.0)
[36m(RayWorkerWrapper pid=4009)[0m frame #6: clone + 0x6d (0x7fa2fd7b5b0d in /usr/lib64/libc.so.6)
[36m(RayWorkerWrapper pid=4009)[0m 
[36m(RayWorkerWrapper pid=4009)[0m Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[36m(RayWorkerWrapper pid=4009)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa2ea16b897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #1: <unknown function> + 0xe083a9 (0x7f9a817d73a9 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #2: <unknown function> + 0x1c220 (0x7fa2ea2b7220 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #3: <unknown function> + 0x7ea5 (0x7fa2fe195ea5 in /usr/lib64/libpthread.so.0)
[36m(RayWorkerWrapper pid=4009)[0m frame #4: clone + 0x6d (0x7fa2fd7b5b0d in /usr/lib64/libc.so.6)
[36m(RayWorkerWrapper pid=4009)[0m 
[36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,182 E 4009 4134] logging.cc:108: Stack trace: 
[36m(RayWorkerWrapper pid=4009)[0m  /usr/local/lib/python3.9/site-packages/ray/_raylet.so(+0x1021b3a) [0x7fa2fa068b3a] ray::operator<<()
[36m(RayWorkerWrapper pid=4009)[0m /usr/local/lib/python3.9/site-packages/ray/_raylet.so(+0x10245f8) [0x7fa2fa06b5f8] ray::TerminateHandler()
[36m(RayWorkerWrapper pid=4009)[0m /usr/lib64/libstdc++.so.6(+0x5ea06) [0x7fa2fd000a06]
[36m(RayWorkerWrapper pid=4009)[0m /usr/lib64/libstdc++.so.6(+0x5ea33) [0x7fa2fd000a33]
[36m(RayWorkerWrapper pid=4009)[0m /usr/lib64/libstdc++.so.6(+0x5e9f4) [0x7fa2fd0009f4]
[36m(RayWorkerWrapper pid=4009)[0m /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so(+0xe0845a) [0x7f9a817d745a] c10d::ProcessGroupNCCL::ncclCommWatchdog()
[36m(RayWorkerWrapper pid=4009)[0m /usr/local/lib/python3.9/site-packages/torch/lib/libtorch.so(+0x1c220) [0x7fa2ea2b7220] execute_native_thread_routine
[36m(RayWorkerWrapper pid=4009)[0m /usr/lib64/libpthread.so.0(+0x7ea5) [0x7fa2fe195ea5] start_thread
[36m(RayWorkerWrapper pid=4009)[0m /usr/lib64/libc.so.6(clone+0x6d) [0x7fa2fd7b5b0d] clone
[36m(RayWorkerWrapper pid=4009)[0m 
[36m(RayWorkerWrapper pid=4009)[0m *** SIGABRT received at time=1720159248 on cpu 5 ***
[36m(RayWorkerWrapper pid=4009)[0m PC: @     0x7fa2fd6ed387  (unknown)  raise
[36m(RayWorkerWrapper pid=4009)[0m     @     0x7fa2fe19d630       3504  (unknown)
[36m(RayWorkerWrapper pid=4009)[0m     @     0x7fa2fd000a06  (unknown)  (unknown)
[36m(RayWorkerWrapper pid=4009)[0m     @     0x7f9968118048  1728156448  (unknown)
[36m(RayWorkerWrapper pid=4009)[0m     @     0x7f9abe1a6580  (unknown)  (unknown)
[36m(RayWorkerWrapper pid=4009)[0m     @ 0xcde907894810c083  (unknown)  (unknown)
[36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,200 E 4009 4134] logging.cc:365: *** SIGABRT received at time=1720159248 on cpu 5 ***
[36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,200 E 4009 4134] logging.cc:365: PC: @     0x7fa2fd6ed387  (unknown)  raise
[36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,201 E 4009 4134] logging.cc:365:     @     0x7fa2fe19d630       3504  (unknown)
[36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,201 E 4009 4134] logging.cc:365:     @     0x7fa2fd000a06  (unknown)  (unknown)
[36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,202 E 4009 4134] logging.cc:365:     @     0x7f9968118048  1728156448  (unknown)
[36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,214 E 4009 4134] logging.cc:365:     @     0x7f9abe1a6580  (unknown)  (unknown)
[36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,215 E 4009 4134] logging.cc:365:     @ 0xcde907894810c083  (unknown)  (unknown)
[36m(RayWorkerWrapper pid=4009)[0m Fatal Python error: Aborted
[36m(RayWorkerWrapper pid=4009)[0m 
[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=5
[rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 5
*** SIGABRT received at time=1720159428 on cpu 134 ***
PC: @     0x7fe6c1e3c387  (unknown)  raise
    @     0x7fe6c28ec630  1656596208  (unknown)
    @ ... and at least 1 more frames
[2024-07-05 14:03:48,574 E 2878 4115] logging.cc:365: *** SIGABRT received at time=1720159428 on cpu 134 ***
[2024-07-05 14:03:48,574 E 2878 4115] logging.cc:365: PC: @     0x7fe6c1e3c387  (unknown)  raise
[2024-07-05 14:03:48,574 E 2878 4115] logging.cc:365:     @     0x7fe6c28ec630  1656596208  (unknown)
[2024-07-05 14:03:48,574 E 2878 4115] logging.cc:365:     @ ... and at least 1 more frames
Fatal Python error: Aborted

haichuan1221 commented 5 hours ago

I have the same issue, do you manage to solve it?

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

We received quite a lot report about "Watchdog caught collective operation timeout", which is flaky and difficult to reproduce. It happens after running for some time.

To analyze the error, we need to collect enough stack trace. If you encounter a similar problem, please paste enough stack trace for us to debug.

Example: https://buildkite.com/vllm/ci-aws/builds/3548#01906e81-54c6-4713-beb7-d08f3c873200 caught one such error.

Please include the first line of error, together with the Python stack trace.

In the following example, it seems one process has illegal memory access. It dies, but the rest process is still in allreduce, and is waiting for it, causing the timeout problem. From the python level stack trace, it happens when we profile the run, and it seems to be related with moe layer.

[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
--
  | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
  |  
  | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd5c7e42b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd5c7f6a718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd57bc4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd57bc4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fd57bc545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd57bc5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #7: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #8: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #9: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | terminate called after throwing an instance of 'c10::DistBackendError'
  | what():  [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
  | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
  |  
  | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd5c7e42b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd5c7f6a718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd57bc4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd57bc4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fd57bc545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd57bc5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #7: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #8: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #9: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: <unknown function> + 0xe32e33 (0x7fd57b8d7e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #3: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #4: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | ERROR 07-01 13:54:43 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 1159 died, exit code: -6
  | INFO 07-01 13:54:43 multiproc_worker_utils.py:123] Killing local vLLM worker processes
  | [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 0] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
  | [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
  | [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9e8b64f1b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9e8b653fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9e8b65531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #5: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #6: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | terminate called after throwing an instance of 'c10::DistBackendError'
  | what():  [PG 2 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9e8b64f1b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9e8b653fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9e8b65531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #5: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #6: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: <unknown function> + 0xe32e33 (0x7f9e8b2d7e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #3: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #4: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Fatal Python error: Aborted
  |  
  | Thread 0x00007f96975fe640 (most recent call first):
  | File "/usr/lib/python3.10/threading.py", line 320 in wait
  | File "/usr/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
  | File "/usr/lib/python3.10/threading.py", line 953 in run
  | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
  |  
  | Thread 0x00007f96acbfd640 (most recent call first):
  | File "/usr/lib/python3.10/threading.py", line 324 in wait
  | File "/usr/lib/python3.10/threading.py", line 607 in wait
  | File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
  |  
  | Thread 0x00007f9f8214e480 (most recent call first):
  | File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 357 in topk_softmax
  | File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 34 in wrapper
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 357 in fused_topk
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 549 in fused_moe
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 273 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 426 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 470 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 540 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1135 in execute_model
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 814 in profile_run
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 175 in determine_num_available_blocks
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 122 in _run_workers
  | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 344 in _initialize_kv_caches
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 405 in from_engine_args
  | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 144 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/models/vllm_causallms.py", line 97 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/api/model.py", line 133 in create_from_arg_string
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/evaluator.py", line 164 in simple_evaluate
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/utils.py", line 288 in _wrapper
  | File "/vllm-workspace/.buildkite/lm-eval-harness/test_lm_eval_correctness.py", line 29 in launch_lm_eval
  | File "/vllm-workspace/.buildkite/lm-eval-harness/test_lm_eval_correctness.py", line 45 in test_lm_eval_correctness
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/python.py", line 162 in pytest_pyfunc_call
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/python.py", line 1632 in runtest
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 173 in pytest_runtest_call
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 241 in <lambda>
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 341 in from_call
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 240 in call_and_report
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 135 in runtestprotocol
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 116 in pytest_runtest_protocol
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 364 in pytest_runtestloop
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 339 in _main
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 285 in wrap_session
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 332 in pytest_cmdline_main
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 178 in main
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 206 in console_main
  | File "/usr/local/bin/pytest", line 8 in <module>

haichuan1221 commented 5 hours ago

for it, causing the timeout problem. From the python level stack trace, it happens when we profile the run, and it seems to be related with moe layer.

Here is the log i get when i run llama model

[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile withTORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bfecb897 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f76bfe7bb25 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f76bffa3718 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f76c11a1e36 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f76c11a5f38 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f76c11ab5ac in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76c11ac31c in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xdbbf4 (0x7f770cc5ebf4 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/bin/../lib/libstdc++.so.6) frame #8: + 0x8609 (0x7f770df21609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #9: clone + 0x43 (0x7f770dcec353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bfecb897 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f76bfe7bb25 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f76bffa3718 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f76c11a1e36 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f76c11a5f38 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f76c11ab5ac in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76c11ac31c in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xdbbf4 (0x7f770cc5ebf4 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/bin/../lib/libstdc++.so.6) frame #8: + 0x8609 (0x7f770df21609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #9: clone + 0x43 (0x7f770dcec353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bfecb897 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f76bfe7bb25 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f76bffa3718 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f76c11a1e36 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f76c11a5f38 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f76c11ab5ac in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76c11ac31c in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xdbbf4 (0x7f770cc5ebf4 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/bin/../lib/libstdc++.so.6) frame #8: + 0x8609 (0x7f770df21609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #9: clone + 0x43 (0x7f770dcec353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bfecb897 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f76bfe7bb25 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f76bffa3718 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f76c11a1e36 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f76c11a5f38 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f76c11ab5ac in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76c11ac31c in /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xdbbf4 (0x7f770cc5ebf4 in /mnt/harddisk/miniconda3/envs/vllm_w4a8/bin/../lib/libstdc++.so.6) frame #8: + 0x8609 (0x7f770df21609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #9: clone + 0x43 (0x7f770dcec353 in /lib/x86_64-linux-gnu/libc.so.6)

/mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' ^C^C^C^C^CAborted (core dumped) `

vllm-project / vllm