[Bug]: AttributeError: 'RayGPUExecutorAsync' object has no attribute 'forward_dag'

Your current environment

The output of `python collect_env.py`

```text Collecting environment information... PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.1 Libc version: glibc-2.35 Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-4.18.0-348.7.1.el8_5.x86_64-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.0.140 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10 GPU 1: NVIDIA A10 Nvidia driver version: 525.147.05 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.8.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 112 On-line CPU(s) list: 0-111 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 28 Socket(s): 2 Stepping: 6 BogoMIPS: 5199.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid md_clear arch_capabilities Hypervisor vendor: KVM Virtualization type: full L1d cache: 2.6 MiB (56 instances) L1i cache: 1.8 MiB (56 instances) L2 cache: 70 MiB (56 instances) L3 cache: 96 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-55 NUMA node1 CPU(s): 56-111 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==8.9.2.26 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.555.43 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.5.82 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.0.3 [pip3] torch==2.3.1 [pip3] torchvision==0.18.1 [pip3] transformers==4.43.3 [pip3] triton==2.3.1 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 8.9.2.26 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.555.43 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.5.82 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 26.0.3 pypi_0 pypi [conda] torch 2.3.1 pypi_0 pypi [conda] torchvision 0.18.1 pypi_0 pypi [conda] transformers 4.43.3 pypi_0 pypi [conda] triton 2.3.1 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.3@bb2fc08072db2d96e547407b4301fb6ba141d9d6 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU0 X PHB 0-55 0 GPU1 PHB X 0-55 0 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

I want to accomplish distributed inference in containers deployed on two different machines via vllm and NCCL. These containers are from one image. Here is the startup command on the head node: python -m vllm.entrypoints.openai.api_server --model /root/vllm/models/Qwen1.5-1.8B-Chat --served-model-name qwen --host 0.0.0.0 --port $API_PORT --tensor-parallel-size 4. Through ray status, I ensured that the containers have 4 available GPUs.

And the errors are as follows:

INFO 08-26 19:08:13 api_server.py:219] vLLM API server version 0.5.3
INFO 08-26 19:08:13 api_server.py:220] args: Namespace(host='0.0.0.0', port=3456, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/root/vllm/models/Qwen1.5-1.8B-Chat', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama3'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 08-26 19:08:13 config.py:724] Defaulting to use ray for distributed inference
2024-08-26 19:08:13,703 INFO worker.py:1596 -- Connecting to existing Ray cluster at address: 192.168.0.3:6377...
2024-08-26 19:08:13,710 INFO worker.py:1781 -- Connected to Ray cluster.
INFO 08-26 19:08:13 llm_engine.py:176] Initializing an LLM engine (v0.5.3) with config: model='/root/vllm/models/Qwen1.5-1.8B-Chat', speculative_config=None, tokenizer='/root/vllm/models/Qwen1.5-1.8B-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=llama3, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-26 19:08:21 utils.py:784] Found nccl from library libnccl.so.2
INFO 08-26 19:08:21 pynccl.py:63] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=3018) INFO 08-26 19:08:21 utils.py:784] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=3018) INFO 08-26 19:08:21 pynccl.py:63] vLLM is using nccl==2.20.5
ERROR 08-26 19:08:22 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 08-26 19:08:22 worker_base.py:382] Traceback (most recent call last):
ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 374, in execute_method
ERROR 08-26 19:08:22 worker_base.py:382]     return executor(*args, **kwargs)
ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 132, in init_device
ERROR 08-26 19:08:22 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
ERROR 08-26 19:08:22 worker_base.py:382]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 923, in ensure_model_parallel_initialized
ERROR 08-26 19:08:22 worker_base.py:382]     initialize_model_parallel(tensor_model_parallel_size,
ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 889, in initialize_model_parallel
ERROR 08-26 19:08:22 worker_base.py:382]     _TP = init_model_parallel_group(group_ranks,
ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 732, in init_model_parallel_group
ERROR 08-26 19:08:22 worker_base.py:382]     return GroupCoordinator(
ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 176, in __init__
ERROR 08-26 19:08:22 worker_base.py:382]     self.pynccl_comm = PyNcclCommunicator(
ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
ERROR 08-26 19:08:22 worker_base.py:382]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
ERROR 08-26 19:08:22 worker_base.py:382]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
ERROR 08-26 19:08:22 worker_base.py:382]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 08-26 19:08:22 worker_base.py:382] RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 317, in <module>
[rank0]:     run_server(args)
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 406, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 61, in _init_executor
[rank0]:     self._init_workers_ray(placement_group)
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 233, in _init_workers_ray
[rank0]:     self._run_workers("init_device")
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 350, in _run_workers
[rank0]:     self.driver_worker.execute_method(method, *driver_args,
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 383, in execute_method
[rank0]:     raise e
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 374, in execute_method
[rank0]:     return executor(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 132, in init_device
[rank0]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
[rank0]:     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 923, in ensure_model_parallel_initialized
[rank0]:     initialize_model_parallel(tensor_model_parallel_size,
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 889, in initialize_model_parallel
[rank0]:     _TP = init_model_parallel_group(group_ranks,
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 732, in init_model_parallel_group
[rank0]:     return GroupCoordinator(
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 176, in __init__
[rank0]:     self.pynccl_comm = PyNcclCommunicator(
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
[rank0]:     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
[rank0]:     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
[rank0]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
[rank0]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank0]: RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382] Traceback (most recent call last):
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 374, in execute_method
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 132, in init_device
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 923, in ensure_model_parallel_initialized
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]     initialize_model_parallel(tensor_model_parallel_size,
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 889, in initialize_model_parallel
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]     _TP = init_model_parallel_group(group_ranks,
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 732, in init_model_parallel_group
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]     return GroupCoordinator(
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 176, in __init__
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]     self.pynccl_comm = PyNcclCommunicator(
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382]     raise RuntimeError(f"NCCL error: {error_str}")
(RayWorkerWrapper pid=3018) ERROR 08-26 19:08:22 worker_base.py:382] RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
(RayWorkerWrapper pid=1138, ip=192.168.0.4) INFO 08-26 19:08:21 utils.py:784] Found nccl from library libnccl.so.2 [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=1138, ip=192.168.0.4) INFO 08-26 19:08:21 pynccl.py:63] vLLM is using nccl==2.20.5 [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution. [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382] Traceback (most recent call last): [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 374, in execute_method [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]     return executor(*args, **kwargs) [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 132, in init_device [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank, [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 923, in ensure_model_parallel_initialized [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]     initialize_model_parallel(tensor_model_parallel_size, [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 889, in initialize_model_parallel [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]     _TP = init_model_parallel_group(group_ranks, [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 732, in init_model_parallel_group [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]     return GroupCoordinator( [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__ [repeated 4x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]     self.pynccl_comm = PyNcclCommunicator( [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank( [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382]     raise RuntimeError(f"NCCL error: {error_str}") [repeated 2x across cluster]
(RayWorkerWrapper pid=1138, ip=192.168.0.4) ERROR 08-26 19:08:22 worker_base.py:382] RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers [repeated 2x across cluster]
Exception ignored in: <function RayGPUExecutorAsync.__del__ at 0x7f7041794040>
Traceback (most recent call last):
  File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 473, in __del__
    if self.forward_dag is not None:
AttributeError: 'RayGPUExecutorAsync' object has no attribute 'forward_dag'

To debugging these errors, I am trying to run the test.py script on both container based on https://docs.vllm.ai/en/latest/getting_started/debugging.html. On the head node I am running it like: NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1 comm_test.py, and on the worker node I am running it like: NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=192.168.0.3 comm_test.py. However, is seems hat there are some intractable problems, the output of the head node is as follows:

W0826 19:55:26.898000 139753538168640 torch/distributed/run.py:757] *****************************************
W0826 19:55:26.898000 139753538168640 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0826 19:55:26.898000 139753538168640 torch/distributed/run.py:757] *****************************************
cce-e0isdmib-zpcplmr3:3355:3355 [0] NCCL INFO cudaDriverVersion 12000
cce-e0isdmib-zpcplmr3:3355:3355 [0] NCCL INFO Bootstrap : Using eth0:192.168.0.3<0>
cce-e0isdmib-zpcplmr3:3356:3356 [1] NCCL INFO cudaDriverVersion 12000
cce-e0isdmib-zpcplmr3:3355:3355 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cce-e0isdmib-zpcplmr3:3356:3356 [1] NCCL INFO Bootstrap : Using eth0:192.168.0.3<0>
cce-e0isdmib-zpcplmr3:3356:3356 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO Failed to open libibverbs.so[.1]
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.0.3<0> [1]veth1f70dc2:fe80::d092:b3ff:fee7:82f1%veth1f70dc2<0> [2]vethc932007:fe80::f498:dfff:fe75:8fe8%vethc932007<0> [3]vethf38ab65:fe80::a025:eaff:fec1:df1b%vethf38ab65<0>
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO Using non-device net plugin version 0
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO Using network Socket
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO Failed to open libibverbs.so[.1]
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.0.3<0> [1]veth1f70dc2:fe80::d092:b3ff:fee7:82f1%veth1f70dc2<0> [2]vethc932007:fe80::f498:dfff:fe75:8fe8%vethc932007<0> [3]vethf38ab65:fe80::a025:eaff:fec1:df1b%vethf38ab65<0>
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO Using non-device net plugin version 0
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO Using network Socket
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO comm 0x400abeb0 rank 9 nranks 10 cudaDev 1 nvmlDev 1 busId 62000 commId 0x40912a92e635507e - Init START
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO comm 0x3f872940 rank 8 nranks 10 cudaDev 0 nvmlDev 0 busId 61000 commId 0x40912a92e635507e - Init START
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO NCCL_SHM_DISABLE set by environment to 1.
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 1.
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO Setting affinity for GPU 1 to ffffff,ffffffff
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO Setting affinity for GPU 0 to ffffff,ffffffff

cce-e0isdmib-zpcplmr3:3355:3366 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer cce-e0isdmib-clvy1hx9.bj.baidu.internal<34986>
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO misc/socket.cc:58 -> 6
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO misc/socket.cc:789 -> 6
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO bootstrap.cc:75 -> 6
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO bootstrap.cc:412 -> 6
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO init.cc:1073 -> 6
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO init.cc:1501 -> 6
cce-e0isdmib-zpcplmr3:3355:3366 [0] NCCL INFO group.cc:64 -> 6 [Async thread]
cce-e0isdmib-zpcplmr3:3355:3355 [0] NCCL INFO group.cc:418 -> 6
cce-e0isdmib-zpcplmr3:3355:3355 [0] NCCL INFO group.cc:95 -> 6
[rank8]: Traceback (most recent call last):
[rank8]:   File "/root/vllm/comm_test.py", line 8, in <module>
[rank8]:     dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank8]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank8]:     return func(*args, **kwargs)
[rank8]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank8]:     work = group.allreduce([tensor], opts)
[rank8]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, remote process exited or there was a network error, NCCL version 2.20.5
[rank8]: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
[rank8]: Last error:
[rank8]: socketProgress: Connection closed by remote peer cce-e0isdmib-clvy1hx9.bj.baidu.internal<34986>

cce-e0isdmib-zpcplmr3:3356:3367 [1] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer cce-e0isdmib-zpcplmr3<57362>
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO misc/socket.cc:58 -> 6
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO misc/socket.cc:789 -> 6
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO bootstrap.cc:75 -> 6
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO bootstrap.cc:412 -> 6
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO init.cc:1073 -> 6
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO init.cc:1501 -> 6
cce-e0isdmib-zpcplmr3:3356:3367 [1] NCCL INFO group.cc:64 -> 6 [Async thread]
cce-e0isdmib-zpcplmr3:3355:3368 [0] NCCL INFO comm 0x3f872940 rank 8 nranks 10 cudaDev 0 busId 61000 - Abort COMPLETE
cce-e0isdmib-zpcplmr3:3356:3356 [1] NCCL INFO group.cc:418 -> 6
cce-e0isdmib-zpcplmr3:3356:3356 [1] NCCL INFO group.cc:95 -> 6
[rank9]: Traceback (most recent call last):
[rank9]:   File "/root/vllm/comm_test.py", line 8, in <module>
[rank9]:     dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank9]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank9]:     return func(*args, **kwargs)
[rank9]:   File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank9]:     work = group.allreduce([tensor], opts)
[rank9]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, remote process exited or there was a network error, NCCL version 2.20.5
[rank9]: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
[rank9]: Last error:
[rank9]: socketProgress: Connection closed by remote peer cce-e0isdmib-zpcplmr3<57362>
cce-e0isdmib-zpcplmr3:3356:3369 [0] NCCL INFO comm 0x400abeb0 rank 9 nranks 10 cudaDev 1 busId 62000 - Abort COMPLETE
E0826 19:55:40.209000 139753538168640 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3355) of binary: /root/anaconda3/envs/vllm/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/vllm/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
comm_test.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-08-26_19:55:40
  host      : cce-e0isdmib-zpcplmr3
  rank      : 9 (local_rank: 1)
  exitcode  : 1 (pid: 3356)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-26_19:55:40
  host      : cce-e0isdmib-zpcplmr3
  rank      : 8 (local_rank: 0)
  exitcode  : 1 (pid: 3355)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

vllm-project / vllm