[Bug]: Reproducing Llama 3.1 distributed inference from the blog

Your current environment

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.31

Python version: 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-100-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 545.23.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      52 bits physical, 57 bits virtual
CPU(s):                             384
On-line CPU(s) list:                0-383
Thread(s) per core:                 2
Core(s) per socket:                 96
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              17
Model name:                         AMD EPYC 9654 96-Core Processor
Stepping:                           1
Frequency boost:                    enabled
CPU MHz:                            2400.000
CPU max MHz:                        3707.8120
CPU min MHz:                        1500.0000
BogoMIPS:                           4800.14
Virtualization:                     AMD-V
L1d cache:                          6 MiB
L1i cache:                          6 MiB
L2 cache:                           192 MiB
L3 cache:                           768 MiB
NUMA node0 CPU(s):                  0-95,192-287
NUMA node1 CPU(s):                  96-191,288-383
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d

Versions of relevant libraries:
[pip3] flashinfer==0.0.9+cu121torch2.3
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.1
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS SYS PIX SYS SYS SYS SYS SYS 0-95,192-287    0       N/A
GPU1    NV18     X  NV18    NV18    NV18    NV18    NV18    NV18    SYS SYS SYS PIX SYS SYS SYS SYS 0-95,192-287    0       N/A
GPU2    NV18    NV18     X  NV18    NV18    NV18    NV18    NV18    SYS PIX SYS SYS SYS SYS SYS SYS 0-95,192-287    0       N/A
GPU3    NV18    NV18    NV18     X  NV18    NV18    NV18    NV18    PIX SYS SYS SYS SYS SYS SYS SYS 0-95,192-287    0       N/A
GPU4    NV18    NV18    NV18    NV18     X  NV18    NV18    NV18    SYS SYS SYS SYS SYS SYS PIX SYS 96-191,288-383  1       N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X  NV18    NV18    SYS SYS SYS SYS SYS SYS SYS PIX 96-191,288-383  1       N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X  NV18    SYS SYS SYS SYS SYS PIX SYS SYS 96-191,288-383  1       N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X  SYS SYS SYS SYS PIX SYS SYS SYS 96-191,288-383  1       N/A
NIC0    SYS SYS SYS PIX SYS SYS SYS SYS  X  SYS SYS SYS SYS SYS SYS SYS
NIC1    SYS SYS PIX SYS SYS SYS SYS SYS SYS  X  SYS SYS SYS SYS SYS SYS
NIC2    PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  SYS SYS SYS SYS SYS
NIC3    SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  SYS SYS SYS SYS
NIC4    SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS  X  SYS SYS SYS
NIC5    SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS  X  SYS SYS
NIC6    SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  SYS
NIC7    SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS  X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7

🐛 Describe the bug

Hi everyone, I am trying to reproduce results from the recent blog on Llama-3.1: https://blog.vllm.ai/2024/07/23/llama31.html. Namely, I am following the docs from https://docs.vllm.ai/en/latest/serving/distributed_serving.html#multi-node-inference-and-serving to set up multi-node serving on two 8xH100 servers.

Step 1) On the first node I am running:

bash run_cluster.sh \
    vllm/vllm-openai \
    "192.168.201.210" \
    --head \
    "/home/eldar/.cache/huggingface"

and on the second node I am running:

bash run_cluster.sh \
    vllm/vllm-openai \
    "192.168.201.210" \
    --worker \
    "/home/eldar/.cache/huggingface"

This step seems to work fine as I am seeing this output in the console:

2024-07-25 04:52:12,672 INFO usage_lib.py:467 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-07-25 04:52:12,672 INFO scripts.py:767 -- Local node IP: 192.168.201.210
2024-07-25 04:52:14,435 SUCC scripts.py:804 -- --------------------
2024-07-25 04:52:14,435 SUCC scripts.py:805 -- Ray runtime started.
2024-07-25 04:52:14,435 SUCC scripts.py:806 -- --------------------
2024-07-25 04:52:14,436 INFO scripts.py:808 -- Next steps
2024-07-25 04:52:14,436 INFO scripts.py:811 -- To add another node to this Ray cluster, run
2024-07-25 04:52:14,436 INFO scripts.py:814 --   ray start --address='192.168.201.210:6379'
2024-07-25 04:52:14,436 INFO scripts.py:823 -- To connect to this Ray cluster:
2024-07-25 04:52:14,436 INFO scripts.py:825 -- import ray
2024-07-25 04:52:14,436 INFO scripts.py:826 -- ray.init()
2024-07-25 04:52:14,436 INFO scripts.py:857 -- To terminate the Ray runtime, run
2024-07-25 04:52:14,436 INFO scripts.py:858 --   ray stop
2024-07-25 04:52:14,436 INFO scripts.py:861 -- To view the status of the cluster, use
2024-07-25 04:52:14,436 INFO scripts.py:862 --   ray status
2024-07-25 04:52:14,436 INFO scripts.py:975 -- --block
2024-07-25 04:52:14,436 INFO scripts.py:976 -- This command will now block forever until terminated by a signal.
2024-07-25 04:52:14,436 INFO scripts.py:979 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

Step 2): To verify that ray sees all gpus from both servers, I run docker exec -it node /bin/bash and ray status outputs:

======== Autoscaler status: 2024-07-25 06:48:41.734229 ========
Node status
---------------------------------------------------------------
Active:
 1 node_1d3801bc78f72fb8e54d584190a122bdde49b01e625e31b44892795c
 1 node_472d1def3b9489763b8208282e02719a16a0b5b6cd936a30f1c2e27a
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/768.0 CPU
 0.0/16.0 GPU
 0B/2.92TiB memory
 0B/19.46GiB object_store_memory

Demands:
 (no resource demands)

which I assume is an indicator that this stage works fine given that it sees all 16 GPUs.

Step 3): I am trying to kick off the vllm serve command from the head node with:

vllm serve /home/meta-llama/Meta-Llama-3.1-8B-Instruct -tp 8 -pp 2

which crashes due to some gloo-related problem:

root@H100-GPU18:/vllm-workspace# vllm serve /home/meta-llama/Meta-Llama-3.1-8B-Instruct -tp 8 -pp 2
INFO 07-25 06:50:45 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 07-25 06:50:45 api_server.py:220] args: Namespace(model_tag='/home/meta-llama/Meta-Llama-3.1-8B-Instruct', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=2, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7f1e64d0f130>)
INFO 07-25 06:50:45 config.py:718] Defaulting to use ray for distributed inference
WARNING 07-25 06:50:45 arg_utils.py:762] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 07-25 06:50:45 config.py:809] Chunked prefill is enabled with max_num_batched_tokens=512.
2024-07-25 06:50:45,878 INFO worker.py:1603 -- Connecting to existing Ray cluster at address: 192.168.201.210:6379...
2024-07-25 06:50:45,885 INFO worker.py:1788 -- Connected to Ray cluster.
INFO 07-25 06:50:48 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/home/meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='/home/meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/meta-llama/Meta-Llama-3.1-8B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
(RayWorkerWrapper pid=439, ip=192.168.201.208) [rank9]:[E ProcessGroupGloo.cpp:144] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[rank0]:[E ProcessGroupGloo.cpp:144] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
ERROR 07-25 06:51:17 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 07-25 06:51:17 worker_base.py:382] Traceback (most recent call last):
ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
ERROR 07-25 06:51:17 worker_base.py:382]     return executor(*args, **kwargs)
ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
ERROR 07-25 06:51:17 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 343, in init_worker_distributed_environment
ERROR 07-25 06:51:17 worker_base.py:382]     init_distributed_environment(parallel_config.world_size, rank,
ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 830, in init_distributed_environment
ERROR 07-25 06:51:17 worker_base.py:382]     _WORLD = init_world_group(ranks, local_rank, backend)
ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 714, in init_world_group
ERROR 07-25 06:51:17 worker_base.py:382]     return GroupCoordinator(
ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 149, in __init__
ERROR 07-25 06:51:17 worker_base.py:382]     cpu_group = torch.distributed.new_group(ranks, backend="gloo")
ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
ERROR 07-25 06:51:17 worker_base.py:382]     func_return = func(*args, **kwargs)
ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
ERROR 07-25 06:51:17 worker_base.py:382]     return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag
ERROR 07-25 06:51:17 worker_base.py:382]     pg, pg_store = _new_process_group_helper(
ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
ERROR 07-25 06:51:17 worker_base.py:382]     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
ERROR 07-25 06:51:17 worker_base.py:382] RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/bin/vllm", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 148, in main
[rank0]:     args.dispatch_function(args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 28, in serve
[rank0]:     run_server(args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 406, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 61, in _init_executor
[rank0]:     self._init_workers_ray(placement_group)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 233, in _init_workers_ray
[rank0]:     self._run_workers("init_device")
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 350, in _run_workers
[rank0]:     self.driver_worker.execute_method(method, *driver_args,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 383, in execute_method
[rank0]:     raise e
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
[rank0]:     return executor(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
[rank0]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 343, in init_worker_distributed_environment
[rank0]:     init_distributed_environment(parallel_config.world_size, rank,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 830, in init_distributed_environment
[rank0]:     _WORLD = init_world_group(ranks, local_rank, backend)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 714, in init_world_group
[rank0]:     return GroupCoordinator(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 149, in __init__
[rank0]:     cpu_group = torch.distributed.new_group(ranks, backend="gloo")
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
[rank0]:     func_return = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
[rank0]:     return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag
[rank0]:     pg, pg_store = _new_process_group_helper(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
[rank0]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank0]: RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382] Traceback (most recent call last):
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 343, in init_worker_distributed_environment
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]     init_distributed_environment(parallel_config.world_size, rank,
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 830, in init_distributed_environment
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]     _WORLD = init_world_group(ranks, local_rank, backend)
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 714, in init_world_group
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]     return GroupCoordinator(
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 149, in __init__
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]     cpu_group = torch.distributed.new_group(ranks, backend="gloo")
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]     func_return = func(*args, **kwargs)
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]     return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]     pg, pg_store = _new_process_group_helper(
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382]     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
(RayWorkerWrapper pid=21468) ERROR 07-25 06:51:17 worker_base.py:382] RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution. [repeated 12x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382] Traceback (most recent call last): [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]     return executor(*args, **kwargs) [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank, [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 343, in init_worker_distributed_environment [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]     init_distributed_environment(parallel_config.world_size, rank, [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 830, in init_distributed_environment [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]     _WORLD = init_world_group(ranks, local_rank, backend) [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 714, in init_world_group [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]     return GroupCoordinator( [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 149, in __init__ [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]     cpu_group = torch.distributed.new_group(ranks, backend="gloo") [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 89, in wrapper [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]     func_return = func(*args, **kwargs) [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]     return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization) [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]     pg, pg_store = _new_process_group_helper( [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382]     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) [repeated 12x across cluster]
(RayWorkerWrapper pid=1355, ip=192.168.201.208) ERROR 07-25 06:51:17 worker_base.py:382] RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error [repeated 12x across cluster]
(RayWorkerWrapper pid=22770) [rank7]:[E ProcessGroupGloo.cpp:144] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error [repeated 12x across cluster]
Exception ignored in: <function RayGPUExecutorAsync.__del__ at 0x7f1e64c405e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 473, in __del__
    if self.forward_dag is not None:
AttributeError: 'RayGPUExecutorAsync' object has no attribute 'forward_dag'

Step 4): In order to debug and isolate this problem, based on the docs at https://docs.vllm.ai/en/latest/getting_started/debugging.html, I am trying to run the test.py script on both nodes at the same time:

import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
data = torch.FloatTensor([1,] * 128).to(f"cuda:{local_rank}")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"

print(f"NCCL is good!")

gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
cpu_data = torch.FloatTensor([1,] * 128)
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
value = cpu_data.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("sanity check is successful!")

On the head node I am running it like: NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1 test.py, and on the worker node I am running it like: NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=192.168.201.210 test.py. The NCCL part works fine, as I am seeing the NCCL is good! printed in the console 8 times (once for each rank on the node). Unfortunately, the GLOO part fails with the same error message ([rank7]: RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error) as above in the vllm serve attempt.

Step 5): Following the debugging tips from the docs page, based on the output of ip addr show

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp51s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether f0:b2:b9:11:27:72 brd ff:ff:ff:ff:ff:ff
3: enp51s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether f0:b2:b9:11:27:73 brd ff:ff:ff:ff:ff:ff
    inet 192.168.201.210/24 metric 100 brd 192.168.201.255 scope global dynamic enp51s0f1
       valid_lft 69422sec preferred_lft 69422sec
    inet6 fe80::f2b2:b9ff:fe11:2773/64 scope link
       valid_lft forever preferred_lft forever
4: usb0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether ba:f7:57:6f:6a:d4 brd ff:ff:ff:ff:ff:ff
5: ibs3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:0d:83:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:c7:64:36 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    altname ibp99s0
    inet6 fe80::a288:c203:c7:6436/64 scope link
       valid_lft forever preferred_lft forever
6: ibs2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:c7:5f:fe brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    altname ibp68s0
    inet6 fe80::a288:c203:c7:5ffe/64 scope link
       valid_lft forever preferred_lft forever
7: ibp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:10:49:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:c7:5f:96 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet6 fe80::a288:c203:c7:5f96/64 scope link
       valid_lft forever preferred_lft forever
8: ibp36s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:10:49:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:32:d9:e2 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet6 fe80::a288:c203:32:d9e2/64 scope link
       valid_lft forever preferred_lft forever
9: ibp227s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:10:49:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:bb:68:c4 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet6 fe80::a288:c203:bb:68c4/64 scope link
       valid_lft forever preferred_lft forever
10: ibp196s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:10:49:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:3d:94:d0 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet6 fe80::a288:c203:3d:94d0/64 scope link
       valid_lft forever preferred_lft forever
11: ibp131s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:10:49:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:8c:aa:72 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet6 fe80::a288:c203:8c:aa72/64 scope link
       valid_lft forever preferred_lft forever
12: ibp164s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:10:49:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:5b:53:3c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet6 fe80::a288:c203:5b:533c/64 scope link
       valid_lft forever preferred_lft forever
13: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:75:88:c1:ae brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:75ff:fe88:c1ae/64 scope link
       valid_lft forever preferred_lft forever
78: br-1a4ca947a9cb: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:29:43:42:b5 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global br-1a4ca947a9cb
       valid_lft forever preferred_lft forever
    inet6 fe80::42:29ff:fe43:42b5/64 scope link
       valid_lft forever preferred_lft forever

I am setting up GLOO_SOCKET_IFNAME=enp51s0f1 on both servers and rerunning the test.py again. On the master node the command now looks like this: GLOO_SOCKET_IFNAME=enp51s0f1 NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1 test.py, and on the worker node: GLOO_SOCKET_IFNAME=enp51s0f1 NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=192.168.201.210 test.py. This seems to solve the problem and the test.py script runs successfully. The expected sanity check is successful! output is printed 8 times in the console.

Step 6): With the hope that setting GLOO_SOCKET_IFNAME=enp51s0f1 is the solution to the original GLOO issue I had with vllm serve, I went back to rerun the command on the head node with this env variable as well. The command now looks like this: GLOO_SOCKET_IFNAME=enp51s0f1 vllm serve /home/meta-llama/Meta-Llama-3.1-8B-Instruct -tp 8 -pp 2. Unfortunately I am again seeing the same error as without this env variable.

INFO 07-25 07:05:10 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 07-25 07:05:10 api_server.py:220] args: Namespace(model_tag='/home/meta-llama/Meta-Llama-3.1-8B-Instruct', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=2, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7efab8680940>)
INFO 07-25 07:05:10 config.py:718] Defaulting to use ray for distributed inference
WARNING 07-25 07:05:10 arg_utils.py:762] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 07-25 07:05:10 config.py:809] Chunked prefill is enabled with max_num_batched_tokens=512.
2024-07-25 07:05:10,452 INFO worker.py:1603 -- Connecting to existing Ray cluster at address: 192.168.201.210:6379...
2024-07-25 07:05:10,461 INFO worker.py:1788 -- Connected to Ray cluster.
INFO 07-25 07:05:10 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/home/meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='/home/meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/meta-llama/Meta-Llama-3.1-8B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
[rank0]:[E ProcessGroupGloo.cpp:144] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
ERROR 07-25 07:05:40 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 07-25 07:05:40 worker_base.py:382] Traceback (most recent call last):
ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
ERROR 07-25 07:05:40 worker_base.py:382]     return executor(*args, **kwargs)
ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
ERROR 07-25 07:05:40 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 343, in init_worker_distributed_environment
ERROR 07-25 07:05:40 worker_base.py:382]     init_distributed_environment(parallel_config.world_size, rank,
ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 830, in init_distributed_environment
ERROR 07-25 07:05:40 worker_base.py:382]     _WORLD = init_world_group(ranks, local_rank, backend)
ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 714, in init_world_group
ERROR 07-25 07:05:40 worker_base.py:382]     return GroupCoordinator(
ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 149, in __init__
ERROR 07-25 07:05:40 worker_base.py:382]     cpu_group = torch.distributed.new_group(ranks, backend="gloo")
ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
ERROR 07-25 07:05:40 worker_base.py:382]     func_return = func(*args, **kwargs)
ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
ERROR 07-25 07:05:40 worker_base.py:382]     return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag
ERROR 07-25 07:05:40 worker_base.py:382]     pg, pg_store = _new_process_group_helper(
ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
ERROR 07-25 07:05:40 worker_base.py:382]     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
ERROR 07-25 07:05:40 worker_base.py:382] RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[rank0]: (RayWorkerWrapper pid=24271) [rank3]:[E ProcessGroupGloo.cpp:144] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/bin/vllm", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 148, in main
[rank0]:     args.dispatch_function(args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 28, in serve
[rank0]:     run_server(args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 406, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 61, in _init_executor
[rank0]:     self._init_workers_ray(placement_group)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 233, in _init_workers_ray
[rank0]:     self._run_workers("init_device")
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 350, in _run_workers
[rank0]:     self.driver_worker.execute_method(method, *driver_args,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 383, in execute_method
[rank0]:     raise e
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
[rank0]:     return executor(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
[rank0]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 343, in init_worker_distributed_environment
[rank0]:     init_distributed_environment(parallel_config.world_size, rank,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 830, in init_distributed_environment
[rank0]:     _WORLD = init_world_group(ranks, local_rank, backend)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 714, in init_world_group
[rank0]:     return GroupCoordinator(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 149, in __init__
[rank0]:     cpu_group = torch.distributed.new_group(ranks, backend="gloo")
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
[rank0]:     func_return = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
[rank0]:     return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag
[rank0]:     pg, pg_store = _new_process_group_helper(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
[rank0]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank0]: RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382] Traceback (most recent call last):
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 343, in init_worker_distributed_environment
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]     init_distributed_environment(parallel_config.world_size, rank,
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 830, in init_distributed_environment
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]     _WORLD = init_world_group(ranks, local_rank, backend)
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 714, in init_world_group
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]     return GroupCoordinator(
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 149, in __init__
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]     cpu_group = torch.distributed.new_group(ranks, backend="gloo")
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]     func_return = func(*args, **kwargs)
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]     return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]     pg, pg_store = _new_process_group_helper(
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382]     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
(RayWorkerWrapper pid=24271) ERROR 07-25 07:05:40 worker_base.py:382] RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382] Traceback (most recent call last):
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 343, in init_worker_distributed_environment
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]     init_distributed_environment(parallel_config.world_size, rank,
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 830, in init_distributed_environment
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]     _WORLD = init_world_group(ranks, local_rank, backend)
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 714, in init_world_group
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]     return GroupCoordinator(
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 149, in __init__
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]     cpu_group = torch.distributed.new_group(ranks, backend="gloo")
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]     func_return = func(*args, **kwargs)
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]     return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]     pg, pg_store = _new_process_group_helper(
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382]     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
(RayWorkerWrapper pid=24884) ERROR 07-25 07:05:40 worker_base.py:382] RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
(RayWorkerWrapper pid=24884) [rank7]:[E ProcessGroupGloo.cpp:144] Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
Exception ignored in: <function RayGPUExecutorAsync.__del__ at 0x7efab8585870>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 473, in __del__
    if self.forward_dag is not None:
AttributeError: 'RayGPUExecutorAsync' object has no attribute 'forward_dag'

Step 7): In a desperate attempt to figure out what is going on, I have tried to apply all suggestions from the debugging docs page and the command I ran looks like this: GLOO_SOCKET_IFNAME=enp51s0f1 VLLM_LOGGING_LEVEL=DEBUG CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=TRACE VLLM_TRACE_FUNCTION=1 vllm serve /home/meta-llama/Meta-Llama-3.1-8B-Instruct -tp 8 -pp 2. Unfortunately, I am not seeing anything different in the output compared to the previous runs.

Sorry for the very long post, I have tried to provide as much information as possible. Any suggestions on what to try next are appreciated. I assume that there is something problematic on the vllm-GLOO relation, given that the test.py example worked just fine for both, NCCL and GLOO.

vllm-project / vllm

[Bug]: Reproducing Llama 3.1 distributed inference from the blog #6775

Your current environment

🐛 Describe the bug