vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.65k stars 4.65k forks source link

[Bug]: 'invalid argument' Error with custom_all_reduce when doing tensor parallelism #9046

Open Luosuu opened 1 month ago

Luosuu commented 1 month ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.6.0.dev20240930+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A OS: Rocky Linux 8.9 (Green Obsidian) (x86_64) GCC version: (GCC) 13.3.0 Clang version: Could not collect CMake version: version 3.30.4 Libc version: glibc-2.28 Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-4.18.0-513.18.1.el8_9.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: 12.4.131 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB GPU 1: NVIDIA A100-SXM4-80GB GPU 2: NVIDIA A100-SXM4-80GB GPU 3: NVIDIA A100-SXM4-80GB Nvidia driver version: 550.54.14 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 2 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7742 64-Core Processor Stepping: 0 CPU MHz: 2250.000 CPU max MHz: 2250.0000 CPU min MHz: 1500.0000 BogoMIPS: 4491.68 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 16384K NUMA node0 CPU(s): 0-15 NUMA node1 CPU(s): 16-31 NUMA node2 CPU(s): 32-47 NUMA node3 CPU(s): 48-63 NUMA node4 CPU(s): 64-79 NUMA node5 CPU(s): 80-95 NUMA node6 CPU(s): 96-111 NUMA node7 CPU(s): 112-127 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.4.5.8 [pip3] nvidia-cuda-cupti-cu12==12.4.127 [pip3] nvidia-cuda-nvrtc-cu12==12.4.127 [pip3] nvidia-cuda-runtime-cu12==12.4.127 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.2.1.3 [pip3] nvidia-curand-cu12==10.3.5.147 [pip3] nvidia-cusolver-cu12==11.6.1.9 [pip3] nvidia-cusparse-cu12==12.3.1.170 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.21.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu12==12.4.127 [pip3] pyzmq==26.2.0 [pip3] torch==2.6.0.dev20240930+cu124 [pip3] torchaudio==2.5.0.dev20240930+cu124 [pip3] torchvision==0.20.0.dev20240930+cu124 [pip3] transformers==4.45.1 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.5.147 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu12 2.21.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.4.127 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.6.0.dev20240930+cu124 pypi_0 pypi [conda] torchaudio 2.5.0.dev20240930+cu124 pypi_0 pypi [conda] torchvision 0.20.0.dev20240930+cu124 pypi_0 pypi [conda] transformers 4.45.1 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.dev51+g1cabfcef.d20241003 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 3 N/A GPU1 NV12 X NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 3 N/A GPU2 NV12 NV12 X NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 16-31 1 N/A GPU3 NV12 NV12 NV12 X SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 16-31 1 N/A NIC0 PXB PXB SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NIC1 PXB PXB SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NIC2 SYS SYS PXB PXB SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS NIC3 SYS SYS PXB PXB SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS NIC4 SYS SYS SYS SYS SYS SYS SYS SYS X PIX SYS SYS SYS SYS SYS SYS NIC5 SYS SYS SYS SYS SYS SYS SYS SYS PIX X SYS SYS SYS SYS SYS SYS NIC6 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS NIC7 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS NIC8 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PXB SYS SYS NIC9 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB X SYS SYS NIC10 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX NIC11 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8 NIC9: mlx5_9 NIC10: mlx5_10 NIC11: mlx5_11 ```

Model Input Dumps

No response

🐛 Describe the bug

I built vllm from source with nightly PyTorch as documented here

then I try

NCCL_DEBUG=TRACE python3 benchmarks/benchmark_latency.py -tp 4
Namespace(model='facebook/opt-125m', speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, tokenizer=None, quantization=None, tensor_parallel_size=4, input_len=32, output_len=128, batch_size=8, n=1, use_beam_search=False, num_iters_warmup=10, num_iters=30, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, profile=False, profile_result_dir=None, device='auto', block_size=16, enable_chunked_prefill=False, enable_prefix_caching=False, use_v2_block_manager=False, ray_workers_use_nsight=False, download_dir=None, output_json=None, gpu_memory_utilization=0.9, load_format='auto', distributed_executor_backend=None, otlp_traces_endpoint=None)
INFO 10-03 11:23:18 config.py:899] Defaulting to use mp for distributed inference
INFO 10-03 11:23:18 llm_engine.py:234] Initializing an LLM engine (v0.6.3.dev51+g1cabfcef.d20241003) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
/project/bi_dsc_large/fad3ew/.conda/vllm_profile/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
WARNING 10-03 11:23:18 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 10-03 11:23:18 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=652775) INFO 10-03 11:23:19 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
(VllmWorkerProcess pid=652774) INFO 10-03 11:23:19 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
(VllmWorkerProcess pid=652776) INFO 10-03 11:23:19 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
INFO 10-03 11:23:21 utils.py:996] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=652774) INFO 10-03 11:23:21 utils.py:996] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=652775) INFO 10-03 11:23:21 utils.py:996] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=652774) INFO 10-03 11:23:21 pynccl.py:63] vLLM is using nccl==2.21.5
INFO 10-03 11:23:21 pynccl.py:63] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=652775) INFO 10-03 11:23:21 pynccl.py:63] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=652776) INFO 10-03 11:23:21 utils.py:996] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=652776) INFO 10-03 11:23:21 pynccl.py:63] vLLM is using nccl==2.21.5
udc-an26-1:652693:652693 [0] NCCL INFO Bootstrap : Using ib4:10.155.48.47<0>
udc-an26-1:652693:652693 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
udc-an26-1:652693:652693 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
udc-an26-1:652693:652693 [0] NCCL INFO NET/Plugin: Using internal network plugin.
udc-an26-1:652693:652693 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
udc-an26-1:652775:652775 [2] NCCL INFO cudaDriverVersion 12040
udc-an26-1:652775:652775 [2] NCCL INFO Bootstrap : Using ib4:10.155.48.47<0>
udc-an26-1:652775:652775 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
udc-an26-1:652775:652775 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
udc-an26-1:652775:652775 [2] NCCL INFO NET/Plugin: Using internal network plugin.
udc-an26-1:652774:652774 [1] NCCL INFO cudaDriverVersion 12040
udc-an26-1:652776:652776 [3] NCCL INFO cudaDriverVersion 12040
udc-an26-1:652774:652774 [1] NCCL INFO Bootstrap : Using ib4:10.155.48.47<0>
udc-an26-1:652776:652776 [3] NCCL INFO Bootstrap : Using ib4:10.155.48.47<0>
udc-an26-1:652774:652774 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
udc-an26-1:652776:652776 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
udc-an26-1:652776:652776 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
udc-an26-1:652774:652774 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
udc-an26-1:652776:652776 [3] NCCL INFO NET/Plugin: Using internal network plugin.
udc-an26-1:652774:652774 [1] NCCL INFO NET/Plugin: Using internal network plugin.
udc-an26-1:652693:652693 [0] NCCL INFO NET/IB : Using [0]mlx5_4:1/IB [RO]; OOB ib4:10.155.48.47<0>
udc-an26-1:652693:652693 [0] NCCL INFO Using non-device net plugin version 0
udc-an26-1:652693:652693 [0] NCCL INFO Using network IB
udc-an26-1:652775:652775 [2] NCCL INFO NET/IB : Using [0]mlx5_4:1/IB [RO]; OOB ib4:10.155.48.47<0>
udc-an26-1:652775:652775 [2] NCCL INFO Using non-device net plugin version 0
udc-an26-1:652775:652775 [2] NCCL INFO Using network IB
udc-an26-1:652774:652774 [1] NCCL INFO NET/IB : Using [0]mlx5_4:1/IB [RO]; OOB ib4:10.155.48.47<0>
udc-an26-1:652774:652774 [1] NCCL INFO Using non-device net plugin version 0
udc-an26-1:652774:652774 [1] NCCL INFO Using network IB
udc-an26-1:652776:652776 [3] NCCL INFO NET/IB : Using [0]mlx5_4:1/IB [RO]; OOB ib4:10.155.48.47<0>
udc-an26-1:652776:652776 [3] NCCL INFO Using non-device net plugin version 0
udc-an26-1:652776:652776 [3] NCCL INFO Using network IB
udc-an26-1:652776:652776 [3] NCCL INFO ncclCommInitRank comm 0xb1b5970 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 4e000 commId 0x948b8f37f33d2b9a - Init START
udc-an26-1:652693:652693 [0] NCCL INFO ncclCommInitRank comm 0xb1b6a80 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 7000 commId 0x948b8f37f33d2b9a - Init START
udc-an26-1:652774:652774 [1] NCCL INFO ncclCommInitRank comm 0xb1b3b20 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId f000 commId 0x948b8f37f33d2b9a - Init START
udc-an26-1:652775:652775 [2] NCCL INFO ncclCommInitRank comm 0xb1b4fa0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 47000 commId 0x948b8f37f33d2b9a - Init START
udc-an26-1:652693:652693 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
udc-an26-1:652775:652775 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
udc-an26-1:652776:652776 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
udc-an26-1:652774:652774 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
udc-an26-1:652693:652693 [0] NCCL INFO NVLS multicast support is not available on dev 0
udc-an26-1:652775:652775 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000
udc-an26-1:652775:652775 [2] NCCL INFO NVLS multicast support is not available on dev 2
udc-an26-1:652776:652776 [3] NCCL INFO Setting affinity for GPU 3 to ffff0000
udc-an26-1:652776:652776 [3] NCCL INFO NVLS multicast support is not available on dev 3
udc-an26-1:652774:652774 [1] NCCL INFO NVLS multicast support is not available on dev 1
udc-an26-1:652693:652693 [0] NCCL INFO comm 0xb1b6a80 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
udc-an26-1:652776:652776 [3] NCCL INFO comm 0xb1b5970 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
udc-an26-1:652775:652775 [2] NCCL INFO comm 0xb1b4fa0 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
udc-an26-1:652774:652774 [1] NCCL INFO comm 0xb1b3b20 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
udc-an26-1:652693:652693 [0] NCCL INFO Channel 00/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 01/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 02/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 03/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 04/24 :    0   1   2   3
udc-an26-1:652776:652776 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 [4] -1/-1/-1->3->2 [5] -1/-1/-1->3->2 [6] -1/-1/-1->3->2 [7] -1/-1/-1->3->2 [8] -1/-1/-1->3->2 [9] -1/-1/-1->3->2 [10] -1/-1/-1->3->2 [11] -1/-1/-1->3->2 [12] -1/-1/-1->3->2 [13] -1/-1/-1->3->2 [14] -1/-1/-1->3->2 [15] -1/-1/-1->3->2 [16] -1/-1/-1->3->2 [17] -1/-1/-1->3->2 [18] -1/-1/-1->3->2 [19] -1/-1/-1->3->2 [20] -1/-1/-1->3->2 [21] -1/-1/-1->3->2 [22] -1/-1/-1->3->2 [23] -1/-1/-1->3->2
udc-an26-1:652693:652693 [0] NCCL INFO Channel 05/24 :    0   1   2   3
udc-an26-1:652775:652775 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
udc-an26-1:652776:652776 [3] NCCL INFO P2P Chunksize set to 524288
udc-an26-1:652693:652693 [0] NCCL INFO Channel 06/24 :    0   1   2   3
udc-an26-1:652774:652774 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
udc-an26-1:652775:652775 [2] NCCL INFO P2P Chunksize set to 524288
udc-an26-1:652693:652693 [0] NCCL INFO Channel 07/24 :    0   1   2   3
udc-an26-1:652774:652774 [1] NCCL INFO P2P Chunksize set to 524288
udc-an26-1:652693:652693 [0] NCCL INFO Channel 08/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 09/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 10/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 11/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 12/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 13/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 14/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 15/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 16/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 17/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 18/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 19/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 20/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 21/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 22/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Channel 23/24 :    0   1   2   3
udc-an26-1:652693:652693 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
udc-an26-1:652693:652693 [0] NCCL INFO P2P Chunksize set to 524288
udc-an26-1:652693:652693 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 02/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 04/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 05/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 06/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 07/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 08/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 09/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 10/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 11/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 12/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 13/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 14/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 15/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 16/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 17/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 18/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 19/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 20/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 21/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 22/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 23/0 : 3[3] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Connected all rings
udc-an26-1:652693:652693 [0] NCCL INFO Connected all rings
udc-an26-1:652775:652775 [2] NCCL INFO Connected all rings
udc-an26-1:652776:652776 [3] NCCL INFO Connected all rings
udc-an26-1:652776:652776 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652776:652776 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652774:652774 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652775:652775 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/IPC/read
udc-an26-1:652693:652693 [0] NCCL INFO Connected all trees
udc-an26-1:652693:652693 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
udc-an26-1:652693:652693 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
udc-an26-1:652774:652774 [1] NCCL INFO Connected all trees
udc-an26-1:652774:652774 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
udc-an26-1:652774:652774 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
udc-an26-1:652775:652775 [2] NCCL INFO Connected all trees
udc-an26-1:652775:652775 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
udc-an26-1:652775:652775 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
udc-an26-1:652776:652776 [3] NCCL INFO Connected all trees
udc-an26-1:652776:652776 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
udc-an26-1:652776:652776 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
udc-an26-1:652775:652775 [2] NCCL INFO TUNER/Plugin: Plugin load returned 11 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
udc-an26-1:652693:652693 [0] NCCL INFO TUNER/Plugin: Plugin load returned 11 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
udc-an26-1:652775:652775 [2] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
udc-an26-1:652776:652776 [3] NCCL INFO TUNER/Plugin: Plugin load returned 11 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
udc-an26-1:652693:652693 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
udc-an26-1:652775:652775 [2] NCCL INFO ncclCommInitRank comm 0xb1b4fa0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 47000 commId 0x948b8f37f33d2b9a - Init COMPLETE
udc-an26-1:652776:652776 [3] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
udc-an26-1:652693:652693 [0] NCCL INFO ncclCommInitRank comm 0xb1b6a80 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 7000 commId 0x948b8f37f33d2b9a - Init COMPLETE
udc-an26-1:652776:652776 [3] NCCL INFO ncclCommInitRank comm 0xb1b5970 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 4e000 commId 0x948b8f37f33d2b9a - Init COMPLETE
udc-an26-1:652774:652774 [1] NCCL INFO TUNER/Plugin: Plugin load returned 11 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
udc-an26-1:652774:652774 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
udc-an26-1:652774:652774 [1] NCCL INFO ncclCommInitRank comm 0xb1b3b20 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId f000 commId 0x948b8f37f33d2b9a - Init COMPLETE
INFO 10-03 11:23:22 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/fad3ew/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorkerProcess pid=652775) INFO 10-03 11:23:22 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/fad3ew/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorkerProcess pid=652774) (VllmWorkerProcess pid=652776) INFO 10-03 11:23:22 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/fad3ew/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
INFO 10-03 11:23:22 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/fad3ew/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
Failed: Cuda error /sfs/gpfs/tardis/project/bi_dsc_large/fad3ew/vllm/csrc/custom_all_reduce.cuh:336 'invalid argument'
Failed: Cuda error /sfs/gpfs/tardis/project/bi_dsc_large/fad3ew/vllm/csrc/custom_all_reduce.cuh:336 'invalid argument'
Failed: Cuda error /sfs/gpfs/tardis/project/bi_dsc_large/fad3ew/vllm/csrc/custom_all_reduce.cuh:336 'invalid argument'
Failed: Cuda error /sfs/gpfs/tardis/project/bi_dsc_large/fad3ew/vllm/csrc/custom_all_reduce.cuh:336 'invalid argument'
[rank2]:[W1003 11:23:22.111490980 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[rank0]:[W1003 11:23:22.111538020 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[rank3]:[W1003 11:23:22.112150538 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[rank1]:[W1003 11:23:22.112150368 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

and I can successfully pass the vLLM nccl test run at here

torchrun --nproc-per-node=4 test_nccl.py
W1003 12:02:34.603000 663602 /sfs/gpfs/tardis/project/bi_dsc_large/fad3ew/.conda/vllm_profile/lib/python3.10/site-packages/torch/distributed/run.py:793] 
W1003 12:02:34.603000 663602 /sfs/gpfs/tardis/project/bi_dsc_large/fad3ew/.conda/vllm_profile/lib/python3.10/site-packages/torch/distributed/run.py:793] *****************************************
W1003 12:02:34.603000 663602 /sfs/gpfs/tardis/project/bi_dsc_large/fad3ew/.conda/vllm_profile/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1003 12:02:34.603000 663602 /sfs/gpfs/tardis/project/bi_dsc_large/fad3ew/.conda/vllm_profile/lib/python3.10/site-packages/torch/distributed/run.py:793] *****************************************
PyTorch NCCL is successful!
PyTorch NCCL is successful!
PyTorch NCCL is successful!
PyTorch NCCL is successful!
PyTorch GLOO is successful!
PyTorch GLOO is successful!
PyTorch GLOO is successful!PyTorch GLOO is successful!

INFO 10-03 12:02:50 utils.py:996] Found nccl from library libnccl.so.2
INFO 10-03 12:02:50 utils.py:996] Found nccl from library libnccl.so.2
INFO 10-03 12:02:50 utils.py:996] Found nccl from library libnccl.so.2
INFO 10-03 12:02:50 utils.py:996] Found nccl from library libnccl.so.2
INFO 10-03 12:02:50 pynccl.py:63] vLLM is using nccl==2.21.5
INFO 10-03 12:02:50 pynccl.py:63] vLLM is using nccl==2.21.5
INFO 10-03 12:02:50 pynccl.py:63] vLLM is using nccl==2.21.5
INFO 10-03 12:02:50 pynccl.py:63] vLLM is using nccl==2.21.5
vLLM NCCL is successful!
vLLM NCCL is successful!
vLLM NCCL is successful!
vLLM NCCL is successful!
vLLM NCCL with cuda graph is successful!vLLM NCCL with cuda graph is successful!vLLM NCCL with cuda graph is successful!

vLLM NCCL with cuda graph is successful!

Before submitting a new issue...

yichiche commented 2 weeks ago

I encountered a similar problem and worked around it by changing pyproject.toml and requirements-cuda.txt from Torch 2.5 to Torch 2.4.