vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.96k stars 4.71k forks source link

[Bug]: watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered #8177

Open DreamGenX opened 2 months ago

DreamGenX commented 2 months ago

Your current environment

The output of `python collect_env.py`. ```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31 Python version: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 GPU 2: NVIDIA H100 80GB HBM3 GPU 3: NVIDIA H100 80GB HBM3 Nvidia driver version: 550.54.15 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 57 bits virtual CPU(s): 120 On-line CPU(s) list: 0-119 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 120 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Platinum 8462Y+ Stepping: 8 CPU MHz: 2800.000 BogoMIPS: 5600.00 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 3.8 MiB L1i cache: 3.8 MiB L2 cache: 480 MiB L3 cache: 1.9 GiB NUMA node0 CPU(s): 0-119 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Unknown: No mitigations Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk avx512_fp16 arch_capabilities Versions of relevant libraries: [pip3] flashinfer==0.1.4+cu121torch2.4 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.20 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5@ vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV6 NV6 NV6 0-119 0 N/A GPU1 NV6 X NV6 NV6 0-119 0 N/A GPU2 NV6 NV6 X NV6 0-119 0 N/A GPU3 NV6 NV6 NV6 X 0-119 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Environemnt summary: vLLM 0.5.5 docker on 4xH100 SXM Model summary: Llama 3 70B in fp8 using AutoFP8 Runtime summary:

--gpu-memory-utilization=0.95 --tensor-parallel-size=4 --disable-log-requests --enable-chunked-prefill --max-num-batched-tokens=8192
INFO 09-03 19:10:35 config.py:813] Defaulting to use mp for distributed inference
INFO 09-03 19:10:35 config.py:911] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 09-03 19:10:35 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='/model', speculative_config=None, tokenizer='/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/model, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 09-03 19:10:36 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 120 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-03 19:10:36 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=140) INFO 09-03 19:10:36 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=139) INFO 09-03 19:10:36 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=138) INFO 09-03 19:10:36 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 09-03 19:10:42 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=138) INFO 09-03 19:10:42 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=139) INFO 09-03 19:10:42 utils.py:975] Found nccl from library libnccl.so.2
INFO 09-03 19:10:42 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=138) INFO 09-03 19:10:42 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=139) INFO 09-03 19:10:42 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=140) INFO 09-03 19:10:42 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=140) INFO 09-03 19:10:42 pynccl.py:63] vLLM is using nccl==2.20.5

🐛 Describe the bug

AsyncLLMEngine causes Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered.

Click to see full logs ```txt INFO: 172.18.0.1:35722 - "POST /generate HTTP/1.1" 200 OK [rank0]:[E904 19:47:16.692386894 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8d14d8df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8d14d3cd10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8d14e68f08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f8d160853e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f8d1608a600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f8d160912ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8d160936fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xd6df4 (0x7f8d6381fdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: + 0x8609 (0x7f8d66005609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) frame #9: clone + 0x43 (0x7f8d6613f353 in /usr/lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' ERROR 09-04 19:47:16 async_llm_engine.py:65] Engine background task failed ERROR 09-04 19:47:16 async_llm_engine.py:65] Traceback (most recent call last): ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion ERROR 09-04 19:47:16 async_llm_engine.py:65] return_value = task.result() ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 930, in run_engine_loop ERROR 09-04 19:47:16 async_llm_engine.py:65] result = task.result() ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 873, in engine_step ERROR 09-04 19:47:16 async_llm_engine.py:65] request_outputs = await self.engine.step_async(virtual_engine) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 337, in step_async ERROR 09-04 19:47:16 async_llm_engine.py:65] output = await self.model_executor.execute_model_async( ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 175, in execute_model_async ERROR 09-04 19:47:16 async_llm_engine.py:65] return await self._driver_execute_model_async(execute_model_req) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 224, in _driver_execute_model_async ERROR 09-04 19:47:16 async_llm_engine.py:65] return await self.driver_exec_model(execute_model_req) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run ERROR 09-04 19:47:16 async_llm_engine.py:65] result = self.fn(*self.args, **self.kwargs) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 322, in execute_model ERROR 09-04 19:47:16 async_llm_engine.py:65] output = self.model_runner.execute_model( ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 09-04 19:47:16 async_llm_engine.py:65] return func(*args, **kwargs) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1415, in execute_model ERROR 09-04 19:47:16 async_llm_engine.py:65] hidden_or_intermediate_states = model_executable( ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-04 19:47:16 async_llm_engine.py:65] return self._call_impl(*args, **kwargs) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-04 19:47:16 async_llm_engine.py:65] return forward_call(*args, **kwargs) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 429, in forward ERROR 09-04 19:47:16 async_llm_engine.py:65] model_output = self.model(input_ids, positions, kv_caches, ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-04 19:47:16 async_llm_engine.py:65] return self._call_impl(*args, **kwargs) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-04 19:47:16 async_llm_engine.py:65] return forward_call(*args, **kwargs) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 329, in forward ERROR 09-04 19:47:16 async_llm_engine.py:65] hidden_states, residual = layer( ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-04 19:47:16 async_llm_engine.py:65] return self._call_impl(*args, **kwargs) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-04 19:47:16 async_llm_engine.py:65] return forward_call(*args, **kwargs) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 251, in forward ERROR 09-04 19:47:16 async_llm_engine.py:65] hidden_states = self.self_attn( ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-04 19:47:16 async_llm_engine.py:65] return self._call_impl(*args, **kwargs) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-04 19:47:16 async_llm_engine.py:65] return forward_call(*args, **kwargs) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 181, in forward ERROR 09-04 19:47:16 async_llm_engine.py:65] attn_output = self.attn(q, k, v, kv_cache, attn_metadata) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-04 19:47:16 async_llm_engine.py:65] return self._call_impl(*args, **kwargs) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-04 19:47:16 async_llm_engine.py:65] return forward_call(*args, **kwargs) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 98, in forward ERROR 09-04 19:47:16 async_llm_engine.py:65] return self.impl.forward(query, ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 692, in forward ERROR 09-04 19:47:16 async_llm_engine.py:65] num_prefill_tokens] = torch.ops.vllm.flash_attn_varlen_func( # noqa ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1061, in __call__ ERROR 09-04 19:47:16 async_llm_engine.py:65] return self_._op(*args, **(kwargs or {})) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 236, in backend_impl ERROR 09-04 19:47:16 async_llm_engine.py:65] result = self._backend_fns[device_type](*args, **kwargs) ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 48, in flash_attn_varlen_func ERROR 09-04 19:47:16 async_llm_engine.py:65] return _flash_attn_varlen_func( ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func ERROR 09-04 19:47:16 async_llm_engine.py:65] return FlashAttnVarlenFunc.apply( ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply ERROR 09-04 19:47:16 async_llm_engine.py:65] return super().apply(*args, **kwargs) # type: ignore[misc] ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward ERROR 09-04 19:47:16 async_llm_engine.py:65] out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward( ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward ERROR 09-04 19:47:16 async_llm_engine.py:65] out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd( ERROR 09-04 19:47:16 async_llm_engine.py:65] RuntimeError: CUDA error: an illegal memory access was encountered ERROR 09-04 19:47:16 async_llm_engine.py:65] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. ERROR 09-04 19:47:16 async_llm_engine.py:65] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 ERROR 09-04 19:47:16 async_llm_engine.py:65] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ERROR 09-04 19:47:16 async_llm_engine.py:65] what(): [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8d14d8df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8d14d3cd10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8d14e68f08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f8d160853e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f8d1608a600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f8d160912ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8d160936fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xd6df4 (0x7f8d6381fdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: + 0x8609 (0x7f8d66005609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) frame #9: clone + 0x43 (0x7f8d6613f353 in /usr/lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8d14d8df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0xe5aa84 (0x7f8d15d1ca84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd6df4 (0x7f8d6381fdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: + 0x8609 (0x7f8d66005609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) frame #4: clone + 0x43 (0x7f8d6613f353 in /usr/lib/x86_64-linux-gnu/libc.so.6) [rank2]:[E904 19:47:16.699524824 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ```

I did not find a way to consistently reporduce it, but it happens in production system under load regularly.

Interestingly, the process does not crash, but generate no longer works.

I have found some similar issues, but it's unclear if it's the same root cause. I tried to provide more details:

Before submitting a new issue...

onlinex commented 2 months ago

Seeing this on both v0.5.4 and v0.5.5 with Mixtral 7B GPTQ 8bit with prefix caching enabled.

DreamGenX commented 2 months ago

@onlinex In my case prefix caching is not enabled

eddiegaoo commented 2 months ago

@DreamGenX Hi! I encounter exactly the same issue here (with no prefix caching enabled). I wonder if you have figured out a way to resolve this issue. In my case, I found that when I only utilized 2 GPUs (with p2p connections), no such issue happened; But when I increased the GPU number to 4 (still with p2p connections), the issue occurred :(

DreamGenX commented 2 months ago

@eddiegaoo try upgrading to 0.6.0 and migrate from autofp8 to llm-compressor

eddiegaoo commented 2 months ago

@eddiegaoo try upgrading to 0.6.0 and migrate from autofp8 to llm-compressor

Many thanks! I'll give it a try

thies1006 commented 2 months ago

I get this error with v0.6.1 and model Meta-Llama-3.1-70B-Instruct-FP8-dynamic on 8xL4 (no enable-chunked-prefill).

dhruvmullick commented 1 month ago

I'm getting the same error on v0.6.3.post1 with neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic

gkm0120 commented 1 week ago

I had the same error using Qwen2.5-72B-Instruct on v0.6.0, enable prefix caching(NVIDIA-SMI 550.54.15、Driver Version: 550.54.15、CUDA Version: 12.4)

[rank0]:[E1115 12:10:49.629906572 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered image image image