[Bug]: Multi GPU setup for VLLM in Openshift still does not work

jayteaftw commented 4 weeks ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.14.0-284.66.1.el9_2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S
GPU 2: NVIDIA L40S
GPU 3: NVIDIA L40S

Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             128
On-line CPU(s) list:                0-127
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9334 32-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 2
Core(s) per socket:                 32
Socket(s):                          2
Stepping:                           1
Frequency boost:                    enabled
CPU max MHz:                        3910.2529
CPU min MHz:                        1500.0000
BogoMIPS:                           5400.11
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          2 MiB (64 instances)
L1i cache:                          2 MiB (64 instances)
L2 cache:                           64 MiB (64 instances)
L3 cache:                           256 MiB (8 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-31,64-95
NUMA node1 CPU(s):                  32-63,96-127
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] transformers==4.41.2
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU1    SYS      X      SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU2    SYS     SYS      X      SYS     SYS     SYS     32-63,96-127    1               N/A
GPU3    SYS     SYS     SYS      X      SYS     SYS     32-63,96-127    1               N/A
NIC0    SYS     SYS     SYS     SYS      X      PIX
NIC1    SYS     SYS     SYS     SYS     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

🐛 Describe the bug

Reposting #4462 as it is still an on going issue.

Vllm best case it inconsistent on if It can start a multi gpu instance within an openshift/k8s enviornment on startup but 99% of the time it fails the start. Ideally if the Nvidia operator is installed and working correctly then when pasted into the POD, vllm should be able to identify the givens GPUs and start; however, it tends to freeze up on start up at the point

INFO 06-09 00:02:22 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mixtral-8x7B-Instruct-v0.1)
(VllmWorkerProcess pid=76) INFO 06-09 00:02:25 multiproc_worker_utils.py:214] Worker ready; awaiting tasks
(VllmWorkerProcess pid=77) INFO 06-09 00:02:25 multiproc_worker_utils.py:214] Worker ready; awaiting tasks
(VllmWorkerProcess pid=75) INFO 06-09 00:02:25 multiproc_worker_utils.py:214] Worker ready; awaiting tasks

Here is the yaml file

apiVersion: apps/v1 
kind: Deployment
metadata:
  name: mixtral-8x7b-instruct-tgi-deploy
  labels:
    app: mixtral-8x7b-instruct-tgi-deploy
spec:
  replicas: 1
  revisionHistoryLimit: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: mixtral-8x7b-instruct-tgi-pod
  template:
    metadata:
      labels:
        app: mixtral-8x7b-instruct-tgi-pod
    spec:
      volumes:
      - name: model
        persistentVolumeClaim:
          claimName: hub-pv-filesystem
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: "15Gi"
      containers:
      - name: mixtral-8x7b-instruct-tgi-pod
        image: vllm/vllm-openai:v0.4.3
        args: ["--model mistralai/Mixtral-8x7B-Instruct-v0.1 --gpu-memory-utilization 0.95 --tensor-parallel-size 4 --distributed-executor-backend mp"]
        ports:
        - containerPort: 8000

        readinessProbe:
          tcpSocket:
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 30
        resources:
          limits:
            nvidia.com/gpu: "4"
          requests:
            cpu: 4
            memory: 8Gi
            nvidia.com/gpu: "4"
        volumeMounts:
        - mountPath: /root/.cache/huggingface/hub
          name: model
        - name: dshm
          mountPath: /dev/shm
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          value: xxxxxx

compared to the previous post, I switched the distributed-executor-backend to mp and it still has the same problem I am running on a L40s.

thobicex commented 4 weeks ago

Did you went through any trail this year?

Yes / No

youkaichao commented 4 weeks ago

Quote from issue templates:

Please set the environment variable export VLLM_LOGGING_LEVEL=DEBUG to turn on more logging to help debugging potential issues.

If you experienced crashes or hangs, it would be helpful to run vllm with export VLLM_TRACE_FUNCTION=1 . All the function calls in vllm will be recorded. Inspect these log files, and tell which function crashes or hangs.

thobicex commented 4 weeks ago

Okay, I'll look into it by next week.

This storm will get to an end once I get my work address fixed on my profile.

jayteaftw commented 4 weeks ago

Quote from issue templates:

Please set the environment variable export VLLM_LOGGING_LEVEL=DEBUG to turn on more logging to help debugging potential issues. If you experienced crashes or hangs, it would be helpful to run vllm with export VLLM_TRACE_FUNCTION=1 . All the function calls in vllm will be recorded. Inspect these log files, and tell which function crashes or hangs.

Thank I updated the env variables

I see 4 different log files

VLLM_TRACE_FUNCTION_for_process_1_thread_140370429153728_at_2024-06-09_00:57:22.494711.log   VLLM_TRACE_FUNCTION_for_process_75_thread_139899367563712_at_2024-06-09_00:57:24.545374.log
VLLM_TRACE_FUNCTION_for_process_74_thread_140356664910272_at_2024-06-09_00:57:24.580614.log  VLLM_TRACE_FUNCTION_for_process_76_thread_140102125220288_at_2024-06-09_00:57:24.507486.log

VLLM_TRACE_FUNCTION_for_process_76_thread.log This is the output from towards the end ofVLLM_TRACE_FUNCTION_for_process_76_thread_140102125220288_at_2024-06-09_00:57:24.507486.log

2024-06-09 00:57:25.047238 Return from is_initialized in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:975 to _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1007
2024-06-09 00:57:25.047261 Call to WORLD in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:588 from _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1012
2024-06-09 00:57:25.047279 Call to default_pg in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:460 from WORLD in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:590
2024-06-09 00:57:25.047296 Return from default_pg in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:468 to WORLD in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:590
2024-06-09 00:57:25.047315 Return from WORLD in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:590 to _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1012
2024-06-09 00:57:25.047332 Call to not_none in /usr/local/lib/python3.10/dist-packages/torch/utils/_typing_utils.py:10 from _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1012
2024-06-09 00:57:25.047350 Return from not_none in /usr/local/lib/python3.10/dist-packages/torch/utils/_typing_utils.py:13 to _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1012
2024-06-09 00:57:25.047367 Return from _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1012 to get_rank in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1746
2024-06-09 00:57:25.047385 Return from get_rank in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1748 to _get_msg_dict in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:56
2024-06-09 00:57:25.047412 Call to version in /usr/local/lib/python3.10/dist-packages/torch/cuda/nccl.py:34 from _get_msg_dict in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:59
2024-06-09 00:57:25.047440 Return from version in /usr/local/lib/python3.10/dist-packages/torch/cuda/nccl.py:41 to _get_msg_dict in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:59
2024-06-09 00:57:25.047460 Call to <genexpr> in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60 from _get_msg_dict in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60
2024-06-09 00:57:25.047478 Return from <genexpr> in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60 to _get_msg_dict in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60
2024-06-09 00:57:25.047495 Call to <genexpr> in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60 from _get_msg_dict in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60
2024-06-09 00:57:25.047512 Return from <genexpr> in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60 to _get_msg_dict in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60
2024-06-09 00:57:25.047529 Call to <genexpr> in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60 from _get_msg_dict in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60
2024-06-09 00:57:25.047546 Return from <genexpr> in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60 to _get_msg_dict in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60
2024-06-09 00:57:25.047564 Call to <genexpr> in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60 from _get_msg_dict in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60
2024-06-09 00:57:25.047582 Return from <genexpr> in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60 to _get_msg_dict in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:60
2024-06-09 00:57:25.047601 Return from _get_msg_dict in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:66 to wrapper in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:92
2024-06-09 00:57:25.047719 Return from wrapper in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:96 to init_distributed_environment in /usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py:104
2024-06-09 00:57:25.047884 Call to is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:105 from init_distributed_environment in /usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py:120
2024-06-09 00:57:25.047911 Call to _is_compiled in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:96 from is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:107
2024-06-09 00:57:25.047931 Return from _is_compiled in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:98 to is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:107
2024-06-09 00:57:25.047950 Call to _nvml_based_avail in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:101 from is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:109
2024-06-09 00:57:25.047974 Return from _nvml_based_avail in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:102 to is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:109
2024-06-09 00:57:25.047993 Return from is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:118 to init_distributed_environment in /usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py:120
2024-06-09 00:57:25.048472 Call to wrapper in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:72 from init_distributed_environment in /usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py:122
2024-06-09 00:57:25.048510 Call to all_reduce in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2146 from wrapper in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:75
2024-06-09 00:57:25.048537 Call to _check_single_tensor in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:860 from all_reduce in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2195
2024-06-09 00:57:25.048558 Return from _check_single_tensor in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:862 to all_reduce in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2195
2024-06-09 00:57:25.048577 Call to _rank_not_in_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:752 from all_reduce in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2196
2024-06-09 00:57:25.048595 Return from _rank_not_in_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:755 to all_reduce in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2196
2024-06-09 00:57:25.048704 Call to _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1005 from all_reduce in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2208
2024-06-09 00:57:25.048723 Call to is_initialized in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:973 from _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1007
2024-06-09 00:57:25.048742 Call to WORLD in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:588 from is_initialized in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:975
2024-06-09 00:57:25.048762 Call to default_pg in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:460 from WORLD in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:590
2024-06-09 00:57:25.048781 Return from default_pg in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:468 to WORLD in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:590
2024-06-09 00:57:25.048797 Return from WORLD in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:590 to is_initialized in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:975
2024-06-09 00:57:25.048815 Return from is_initialized in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:975 to _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1007
2024-06-09 00:57:25.048835 Call to WORLD in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:588 from _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1012
2024-06-09 00:57:25.048852 Call to default_pg in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:460 from WORLD in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:590
2024-06-09 00:57:25.048869 Return from default_pg in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:468 to WORLD in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:590
2024-06-09 00:57:25.048886 Return from WORLD in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:590 to _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1012
2024-06-09 00:57:25.048903 Call to not_none in /usr/local/lib/python3.10/dist-packages/torch/utils/_typing_utils.py:10 from _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1012
2024-06-09 00:57:25.048921 Return from not_none in /usr/local/lib/python3.10/dist-packages/torch/utils/_typing_utils.py:13 to _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1012
2024-06-09 00:57:25.048937 Return from _get_default_group in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1012 to all_reduce in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2208
2024-06-09 00:57:25.048956 Call to pg_coalesce_state in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:543 from all_reduce in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2210
2024-06-09 00:57:25.048975 Return from pg_coalesce_state in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:545 to all_reduce in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2210
2024-06-09 00:57:25.182103 Return from all_reduce in /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:2224 to wrapper in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:75
2024-06-09 00:57:25.182185 Return from wrapper in /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py:75 to init_distributed_environment in /usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py:122
2024-06-09 00:57:25.182215 Call to is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:105 from init_distributed_environment in /usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py:123
2024-06-09 00:57:25.182237 Call to _is_compiled in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:96 from is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:107
2024-06-09 00:57:25.182267 Return from _is_compiled in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:98 to is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:107
2024-06-09 00:57:25.182288 Call to _nvml_based_avail in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:101 from is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:109
2024-06-09 00:57:25.182328 Return from _nvml_based_avail in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:102 to is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:109
2024-06-09 00:57:25.182351 Return from is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:118 to init_distributed_environment in /usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py:123
2024-06-09 00:57:25.182372 Call to synchronize in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:782 from init_distributed_environment in /usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py:124
2024-06-09 00:57:25.182396 Call to _lazy_init in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:263 from synchronize in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:790
2024-06-09 00:57:25.182416 Call to is_initialized in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:216 from _lazy_init in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:265
2024-06-09 00:57:25.182435 Return from is_initialized in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:218 to _lazy_init in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:265
2024-06-09 00:57:25.182454 Return from _lazy_init in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:266 to synchronize in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:790
2024-06-09 00:57:25.182476 Call to __init__ in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:360 from synchronize in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:791
2024-06-09 00:57:25.182498 Call to _get_device_index in /usr/local/lib/python3.10/dist-packages/torch/cuda/_utils.py:9 from __init__ in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:361
2024-06-09 00:57:25.182522 Call to is_scripting in /usr/local/lib/python3.10/dist-packages/torch/_jit_internal.py:1120 from _get_device_index in /usr/local/lib/python3.10/dist-packages/torch/cuda/_utils.py:35
2024-06-09 00:57:25.182540 Return from is_scripting in /usr/local/lib/python3.10/dist-packages/torch/_jit_internal.py:1139 to _get_device_index in /usr/local/lib/python3.10/dist-packages/torch/cuda/_utils.py:35
2024-06-09 00:57:25.182560 Call to _get_device_index in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:759 from _get_device_index in /usr/local/lib/python3.10/dist-packages/torch/cuda/_utils.py:38
2024-06-09 00:57:25.182581 Call to is_scripting in /usr/local/lib/python3.10/dist-packages/torch/_jit_internal.py:1120 from _get_device_index in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:793
2024-06-09 00:57:25.182599 Return from is_scripting in /usr/local/lib/python3.10/dist-packages/torch/_jit_internal.py:1139 to _get_device_index in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:793
2024-06-09 00:57:25.182618 Call to _get_current_device_index in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:733 from _get_device_index in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:796
2024-06-09 00:57:25.182638 Call to _get_device_attr in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:721 from _get_current_device_index in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:735
2024-06-09 00:57:25.182657 Call to _get_available_device_type in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:708 from _get_device_attr in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:722
2024-06-09 00:57:25.182675 Call to is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:105 from _get_available_device_type in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:709
2024-06-09 00:57:25.182692 Call to _is_compiled in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:96 from is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:107
2024-06-09 00:57:25.182710 Return from _is_compiled in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:98 to is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:107
2024-06-09 00:57:25.182728 Call to _nvml_based_avail in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:101 from is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:109
2024-06-09 00:57:25.182751 Return from _nvml_based_avail in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:102 to is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:109
2024-06-09 00:57:25.182771 Return from is_available in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:118 to _get_available_device_type in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:709
2024-06-09 00:57:25.182789 Return from _get_available_device_type in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:710 to _get_device_attr in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:722
2024-06-09 00:57:25.182808 Call to <lambda> in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:735 from _get_device_attr in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:724
2024-06-09 00:57:25.182827 Call to current_device in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:776 from <lambda> in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:735
2024-06-09 00:57:25.182845 Call to _lazy_init in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:263 from current_device in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:778
2024-06-09 00:57:25.182863 Call to is_initialized in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:216 from _lazy_init in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:265
2024-06-09 00:57:25.182881 Return from is_initialized in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:218 to _lazy_init in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:265
2024-06-09 00:57:25.182900 Return from _lazy_init in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:266 to current_device in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:778
2024-06-09 00:57:25.182922 Return from current_device in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:779 to <lambda> in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:735
2024-06-09 00:57:25.182939 Return from <lambda> in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:735 to _get_device_attr in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:724
2024-06-09 00:57:25.182956 Return from _get_device_attr in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:724 to _get_current_device_index in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:735
2024-06-09 00:57:25.182974 Return from _get_current_device_index in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:735 to _get_device_index in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:796
2024-06-09 00:57:25.182991 Return from _get_device_index in /usr/local/lib/python3.10/dist-packages/torch/_utils.py:801 to _get_device_index in /usr/local/lib/python3.10/dist-packages/torch/cuda/_utils.py:38
2024-06-09 00:57:25.183009 Return from _get_device_index in /usr/local/lib/python3.10/dist-packages/torch/cuda/_utils.py:38 to __init__ in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:361
2024-06-09 00:57:25.183030 Return from __init__ in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:362 to synchronize in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:791
2024-06-09 00:57:25.183050 Call to __enter__ in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:364 from synchronize in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:791
2024-06-09 00:57:25.183071 Return from __enter__ in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:365 to synchronize in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:791

Unsure of where the error would be

thobicex commented 4 weeks ago

I will figure it out in two business working days, thank you for your advice.

I will let you know once I figure it out by Monday.

youkaichao commented 4 weeks ago

seems to be a pytorch/cuda initialization problem. you can try to run:

# test.py
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
data = torch.ByteTensor([1,] * 128).to(f"cuda:{dist.get_rank()}")
dist.all_reduce(data, op=dist.ReduceOp.SUM)

with

export NCCL_DEBUG=TRACE
torchrun --nproc-per-node=4 test.py

to see if it works.

jayteaftw commented 4 weeks ago

Okay I ran the test.py in the container and got this output

W0609 01:26:10.258000 140697551208896 torch/distributed/run.py:757] 
W0609 01:26:10.258000 140697551208896 torch/distributed/run.py:757] *****************************************
W0609 01:26:10.258000 140697551208896 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0609 01:26:10.258000 140697551208896 torch/distributed/run.py:757] *****************************************
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:428 [0] NCCL INFO Bootstrap : Using eth0:10.128.0.68<0>
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:428 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:428 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda12.4
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:431 [3] NCCL INFO cudaDriverVersion 12040
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:431 [3] NCCL INFO Bootstrap : Using eth0:10.128.0.68<0>
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:431 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:430 [2] NCCL INFO cudaDriverVersion 12040
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:430 [2] NCCL INFO Bootstrap : Using eth0:10.128.0.68<0>
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:430 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:429 [1] NCCL INFO cudaDriverVersion 12040
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:429 [1] NCCL INFO Bootstrap : Using eth0:10.128.0.68<0>
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:429 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO Failed to open libibverbs.so[.1]
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO NET/Socket : Using [0]eth0:10.128.0.68<0>
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO Using non-device net plugin version 0
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO Using network Socket
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO Failed to open libibverbs.so[.1]
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO NET/Socket : Using [0]eth0:10.128.0.68<0>
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO Using non-device net plugin version 0
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO Using network Socket
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO Failed to open libibverbs.so[.1]
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO NET/Socket : Using [0]eth0:10.128.0.68<0>
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO Using non-device net plugin version 0
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO Using network Socket
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO Failed to open libibverbs.so[.1]
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO NET/Socket : Using [0]eth0:10.128.0.68<0>
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO Using non-device net plugin version 0
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO Using network Socket
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO comm 0x5563a3113b20 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 61000 commId 0x81e31ac233832a8e - Init START
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO comm 0x55eeb2ddb740 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 41000 commId 0x81e31ac233832a8e - Init START
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO comm 0x55738de34760 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 21000 commId 0x81e31ac233832a8e - Init START
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO comm 0x5592b925a920 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 81000 commId 0x81e31ac233832a8e - Init START
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO NVLS multicast support is not available on dev 1
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO NVLS multicast support is not available on dev 2
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO NVLS multicast support is not available on dev 0
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff,00000000
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO NVLS multicast support is not available on dev 3
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO comm 0x5563a3113b20 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO comm 0x5592b925a920 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO comm 0x55eeb2ddb740 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO P2P Chunksize set to 131072
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO P2P Chunksize set to 131072
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO comm 0x55738de34760 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO P2P Chunksize set to 131072
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO Channel 00/02 :    0   1   2   3
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO Channel 01/02 :    0   1   2   3
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO P2P Chunksize set to 131072
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO Connected all rings
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO Connected all rings
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO Connected all rings
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO Connected all rings
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO Connected all trees
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO Connected all trees
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO Connected all trees
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO Connected all trees
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:454 [2] NCCL INFO comm 0x5563a3113b20 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 61000 commId 0x81e31ac233832a8e - Init COMPLETE
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:452 [0] NCCL INFO comm 0x55738de34760 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 21000 commId 0x81e31ac233832a8e - Init COMPLETE
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:455 [1] NCCL INFO comm 0x55eeb2ddb740 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 41000 commId 0x81e31ac233832a8e - Init COMPLETE
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:453 [3] NCCL INFO comm 0x5592b925a920 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 81000 commId 0x81e31ac233832a8e - Init COMPLETE
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:457 [2] NCCL INFO [Service thread] Connection closed by localRank 2
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:456 [1] NCCL INFO [Service thread] Connection closed by localRank 1
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:458 [3] NCCL INFO [Service thread] Connection closed by localRank 3
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:459 [0] NCCL INFO [Service thread] Connection closed by localRank 0
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:430:470 [0] NCCL INFO comm 0x5563a3113b20 rank 2 nranks 4 cudaDev 2 busId 61000 - Abort COMPLETE
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:429:468 [0] NCCL INFO comm 0x55eeb2ddb740 rank 1 nranks 4 cudaDev 1 busId 41000 - Abort COMPLETE
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:431:469 [0] NCCL INFO comm 0x5592b925a920 rank 3 nranks 4 cudaDev 3 busId 81000 - Abort COMPLETE
mixtral-8x7b-instruct-tgi-deploy-76b99c95f9-nb7tf:428:471 [0] NCCL INFO comm 0x55738de34760 rank 0 nranks 4 cudaDev 0 busId 21000 - Abort COMPLETE

thobicex commented 4 weeks ago

The issues is minority report, of which I tired to fill out the information needed.

Thank you all for your time, I'm going offline now. Till business working days. Have a great weekend everyone.

youkaichao commented 4 weeks ago

Your log file only shows 1 second. How long does it hang?

jayteaftw commented 4 weeks ago

Your log file only shows 1 second. How long does it hang?

If you are talking about the vllm instance, it is still hanging (so about 34 minutes since creation) The test.py went through

youkaichao commented 4 weeks ago

this is the start line:

2024-06-09 00:57:24.508092 Call to <module> in /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py:1 from _call_with_frames_removed in <frozen importlib._bootstrap>:241

this is the end line:

2024-06-09 00:57:25.183071 Return from __enter__ in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:365 to synchronize in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:791

only one second elapsed.

if it is still hanging, it is hanging inside pytorch code.

jayteaftw commented 4 weeks ago

this is the start line:

2024-06-09 00:57:24.508092 Call to <module> in /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py:1 from _call_with_frames_removed in <frozen importlib._bootstrap>:241

this is the end line:

2024-06-09 00:57:25.183071 Return from __enter__ in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:365 to synchronize in /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:791

only one second elapsed.

if it is still hanging, it is hanging inside pytorch code.

Okay, diagnosis wise what does that mean? Also It actually has only been alive 10m. I think the pod restarts after it hangs for that much time

youkaichao commented 4 weeks ago

I don't think this is vllm's problem. From vllm side, I can tell there are something wrong with the environment, so the distributed environment cannot be set up.

You can try to switch different factors, e.g. changing the gpu model you use, the gpu driver version you use, and try to use a physical machine rather than k8s container. I don't know anything about Openshift, so I cannot help here.

jayteaftw commented 4 weeks ago

Is VLLM unable to run in a container with 4 l40s hook to it through the nvidia runtime?

dtrifiro commented 3 weeks ago

I'm seeing a similar issue,although not on openshift (AWS g5 instance with 4xA10G).

When running with --tensor-parallel-size=4, vLLM hangs and eventually times out and crashes.

Logs below, with original command being:

env NCCL_DEBUG=TRACE \
  VLLM_NCCL_SO_PATH=~/.config/vllm/nccl/cu12/libnccl.so.2.18.1 \
  VLLM_WORKER_MULTIPROC_METHOD=fork \
  python -m vllm.entrypoints.openai.api_server \
    --tensor-parallel-size=4 \
    --max-model-len=4096 \
    --model mistralai/Mixtral-8x7B-v0.1 \
    --distributed-executor-backend=mp

Resulting in a crash after 10mins with :

[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600091 milliseconds before timing out.

Full logs below:

```console $ NCCL_DEBUG=TRACE VLLM_NCCL_SO_PATH=~/.config/vllm/nccl/cu12/libnccl.so.2.18.1 VLLM_WORKER_MULTIPROC_METHOD=fork python -m vllm.entrypoints.openai.api_server --tensor-parallel-size=4 --max-model-len=4096 --model mistralai/Mixtral-8x7B-v0.1 --distributed-executor-backend=mp INFO 06-10 15:44:38 api_server.py:177] vLLM API server version 0.4.3 INFO 06-10 15:44:38 api_server.py:178] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='mistralai/Mixtral-8x7B-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, rope_scaling=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) /mnt/data/vllm/.venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( INFO 06-10 15:44:38 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='mistralai/Mixtral-8x7B-v0.1', speculative_config=None, tokenizer='mistralai/Mixtral-8x7B-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mixtral-8x7B-v0.1) (VllmWorkerProcess pid=150131) INFO 06-10 15:44:40 multiproc_worker_utils.py:214] Worker ready; awaiting tasks (VllmWorkerProcess pid=150132) INFO 06-10 15:44:40 multiproc_worker_utils.py:214] Worker ready; awaiting tasks (VllmWorkerProcess pid=150133) INFO 06-10 15:44:40 multiproc_worker_utils.py:214] Worker ready; awaiting tasks dtrifiro-gpu:150030:150030 [0] NCCL INFO Bootstrap : Using ens5:10.0.48.206<0> dtrifiro-gpu:150030:150030 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation dtrifiro-gpu:150030:150030 [0] NCCL INFO cudaDriverVersion 12050 NCCL version 2.20.5+cuda12.4 dtrifiro-gpu:150132:150132 [2] NCCL INFO cudaDriverVersion 12050 dtrifiro-gpu:150132:150132 [2] NCCL INFO Bootstrap : Using ens5:10.0.48.206<0> dtrifiro-gpu:150132:150132 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation dtrifiro-gpu:150133:150133 [3] NCCL INFO cudaDriverVersion 12050 dtrifiro-gpu:150133:150133 [3] NCCL INFO Bootstrap : Using ens5:10.0.48.206<0> dtrifiro-gpu:150133:150133 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation dtrifiro-gpu:150131:150131 [1] NCCL INFO cudaDriverVersion 12050 dtrifiro-gpu:150131:150131 [1] NCCL INFO Bootstrap : Using ens5:10.0.48.206<0> dtrifiro-gpu:150131:150131 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation dtrifiro-gpu:150030:150215 [0] NCCL INFO Failed to open libibverbs.so[.1] dtrifiro-gpu:150030:150215 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.48.206<0> dtrifiro-gpu:150030:150215 [0] NCCL INFO Using non-device net plugin version 0 dtrifiro-gpu:150030:150215 [0] NCCL INFO Using network Socket dtrifiro-gpu:150132:150216 [2] NCCL INFO Failed to open libibverbs.so[.1] dtrifiro-gpu:150132:150216 [2] NCCL INFO NET/Socket : Using [0]ens5:10.0.48.206<0> dtrifiro-gpu:150132:150216 [2] NCCL INFO Using non-device net plugin version 0 dtrifiro-gpu:150132:150216 [2] NCCL INFO Using network Socket dtrifiro-gpu:150131:150218 [1] NCCL INFO Failed to open libibverbs.so[.1] dtrifiro-gpu:150133:150217 [3] NCCL INFO Failed to open libibverbs.so[.1] dtrifiro-gpu:150131:150218 [1] NCCL INFO NET/Socket : Using [0]ens5:10.0.48.206<0> dtrifiro-gpu:150131:150218 [1] NCCL INFO Using non-device net plugin version 0 dtrifiro-gpu:150131:150218 [1] NCCL INFO Using network Socket dtrifiro-gpu:150133:150217 [3] NCCL INFO NET/Socket : Using [0]ens5:10.0.48.206<0> dtrifiro-gpu:150133:150217 [3] NCCL INFO Using non-device net plugin version 0 dtrifiro-gpu:150133:150217 [3] NCCL INFO Using network Socket dtrifiro-gpu:150133:150217 [3] NCCL INFO comm 0xaea9c50 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 1e0 commId 0x78852b713d2d90c5 - Init START dtrifiro-gpu:150030:150215 [0] NCCL INFO comm 0xaeaafe0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1b0 commId 0x78852b713d2d90c5 - Init START dtrifiro-gpu:150131:150218 [1] NCCL INFO comm 0xaea9020 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1c0 commId 0x78852b713d2d90c5 - Init START dtrifiro-gpu:150132:150216 [2] NCCL INFO comm 0xaea8b10 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 1d0 commId 0x78852b713d2d90c5 - Init START dtrifiro-gpu:150133:150217 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. dtrifiro-gpu:150030:150215 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. dtrifiro-gpu:150133:150217 [3] NCCL INFO NVLS multicast support is not available on dev 3 dtrifiro-gpu:150131:150218 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. dtrifiro-gpu:150132:150216 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. dtrifiro-gpu:150030:150215 [0] NCCL INFO NVLS multicast support is not available on dev 0 dtrifiro-gpu:150131:150218 [1] NCCL INFO NVLS multicast support is not available on dev 1 dtrifiro-gpu:150132:150216 [2] NCCL INFO NVLS multicast support is not available on dev 2 dtrifiro-gpu:150131:150218 [1] NCCL INFO comm 0xaea9020 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0 dtrifiro-gpu:150132:150216 [2] NCCL INFO comm 0xaea8b10 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0 dtrifiro-gpu:150030:150215 [0] NCCL INFO comm 0xaeaafe0 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0 dtrifiro-gpu:150133:150217 [3] NCCL INFO comm 0xaea9c50 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0 dtrifiro-gpu:150131:150218 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 dtrifiro-gpu:150030:150215 [0] NCCL INFO Channel 00/04 : 0 1 2 3 dtrifiro-gpu:150132:150216 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 dtrifiro-gpu:150133:150217 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 dtrifiro-gpu:150131:150218 [1] NCCL INFO P2P Chunksize set to 131072 dtrifiro-gpu:150030:150215 [0] NCCL INFO Channel 01/04 : 0 1 2 3 dtrifiro-gpu:150132:150216 [2] NCCL INFO P2P Chunksize set to 131072 dtrifiro-gpu:150133:150217 [3] NCCL INFO P2P Chunksize set to 131072 dtrifiro-gpu:150030:150215 [0] NCCL INFO Channel 02/04 : 0 1 2 3 dtrifiro-gpu:150030:150215 [0] NCCL INFO Channel 03/04 : 0 1 2 3 dtrifiro-gpu:150030:150215 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 dtrifiro-gpu:150030:150215 [0] NCCL INFO P2P Chunksize set to 131072 dtrifiro-gpu:150131:150218 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC dtrifiro-gpu:150030:150215 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC dtrifiro-gpu:150132:150216 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC dtrifiro-gpu:150133:150217 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/IPC dtrifiro-gpu:150131:150218 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC dtrifiro-gpu:150030:150215 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC dtrifiro-gpu:150132:150216 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC dtrifiro-gpu:150133:150217 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/IPC dtrifiro-gpu:150131:150218 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC dtrifiro-gpu:150030:150215 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC dtrifiro-gpu:150132:150216 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC dtrifiro-gpu:150133:150217 [3] NCCL INFO Channel 02/0 : 3[3] -> 0[0] via P2P/IPC dtrifiro-gpu:150131:150218 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC dtrifiro-gpu:150030:150215 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC dtrifiro-gpu:150132:150216 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC dtrifiro-gpu:150133:150217 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/IPC dtrifiro-gpu:150132:150216 [2] NCCL INFO Connected all rings dtrifiro-gpu:150131:150218 [1] NCCL INFO Connected all rings dtrifiro-gpu:150133:150217 [3] NCCL INFO Connected all rings dtrifiro-gpu:150133:150217 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC dtrifiro-gpu:150030:150215 [0] NCCL INFO Connected all rings dtrifiro-gpu:150133:150217 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC dtrifiro-gpu:150133:150217 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC dtrifiro-gpu:150133:150217 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC dtrifiro-gpu:150132:150216 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC dtrifiro-gpu:150131:150218 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC dtrifiro-gpu:150132:150216 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC dtrifiro-gpu:150131:150218 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC dtrifiro-gpu:150132:150216 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC dtrifiro-gpu:150131:150218 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC dtrifiro-gpu:150132:150216 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC dtrifiro-gpu:150131:150218 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC dtrifiro-gpu:150133:150217 [3] NCCL INFO Connected all trees dtrifiro-gpu:150133:150217 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 dtrifiro-gpu:150133:150217 [3] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer dtrifiro-gpu:150132:150216 [2] NCCL INFO Connected all trees dtrifiro-gpu:150132:150216 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 dtrifiro-gpu:150132:150216 [2] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer dtrifiro-gpu:150131:150218 [1] NCCL INFO Connected all trees dtrifiro-gpu:150030:150215 [0] NCCL INFO Connected all trees dtrifiro-gpu:150131:150218 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 dtrifiro-gpu:150131:150218 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer dtrifiro-gpu:150030:150215 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 dtrifiro-gpu:150030:150215 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer dtrifiro-gpu:150133:150217 [3] NCCL INFO comm 0xaea9c50 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 1e0 commId 0x78852b713d2d90c5 - Init COMPLETE dtrifiro-gpu:150131:150218 [1] NCCL INFO comm 0xaea9020 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1c0 commId 0x78852b713d2d90c5 - Init COMPLETE dtrifiro-gpu:150132:150216 [2] NCCL INFO comm 0xaea8b10 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 1d0 commId 0x78852b713d2d90c5 - Init COMPLETE dtrifiro-gpu:150030:150215 [0] NCCL INFO comm 0xaeaafe0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1b0 commId 0x78852b713d2d90c5 - Init COMPLETE [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600017 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: 26408258722611. [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600017 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c0877a897 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2bbbfaa1b2 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2bbbfaefd0 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2bbbfb031c in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f2c07cd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x89134 (0x7f2c09461134 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x1097dc (0x7f2c094e17dc in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600017 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c0877a897 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2bbbfaa1b2 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2bbbfaefd0 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2bbbfb031c in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f2c07cd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x89134 (0x7f2c09461134 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x1097dc (0x7f2c094e17dc in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c0877a897 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0xe32e33 (0x7f2bbbc32e33 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd44a3 (0x7f2c07cd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: + 0x89134 (0x7f2c09461134 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x1097dc (0x7f2c094e17dc in /lib/x86_64-linux-gnu/libc.so.6) [rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: 26408958397760. [rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c0877a897 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2bbbfaa1b2 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2bbbfaefd0 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2bbbfb031c in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f2c07cd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x89134 (0x7f2c09461134 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x1097dc (0x7f2c094e17dc in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c0877a897 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2bbbfaa1b2 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2bbbfaefd0 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2bbbfb031c in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f2c07cd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x89134 (0x7f2c09461134 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x1097dc (0x7f2c094e17dc in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c0877a897 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0xe32e33 (0x7f2bbbc32e33 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd44a3 (0x7f2c07cd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: + 0x89134 (0x7f2c09461134 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x1097dc (0x7f2c094e17dc in /lib/x86_64-linux-gnu/libc.so.6) [rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. [rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: 26407437985505. [rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c0877a897 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2bbbfaa1b2 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2bbbfaefd0 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2bbbfb031c in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f2c07cd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x89134 (0x7f2c09461134 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x1097dc (0x7f2c094e17dc in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c0877a897 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2bbbfaa1b2 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2bbbfaefd0 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2bbbfb031c in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f2c07cd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x89134 (0x7f2c09461134 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x1097dc (0x7f2c094e17dc in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c0877a897 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0xe32e33 (0x7f2bbbc32e33 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd44a3 (0x7f2c07cd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: + 0x89134 (0x7f2c09461134 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x1097dc (0x7f2c094e17dc in /lib/x86_64-linux-gnu/libc.so.6) [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600091 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: 26406255182832. [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600091 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c0877a897 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2bbbfaa1b2 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2bbbfaefd0 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2bbbfb031c in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f2c07cd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x89134 (0x7f2c09461134 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x1097dc (0x7f2c094e17dc in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600091 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c0877a897 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7f2bbbfaa1b2 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2bbbfaefd0 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2bbbfb031c in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd44a3 (0x7f2c07cd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x89134 (0x7f2c09461134 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x1097dc (0x7f2c094e17dc in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2c0877a897 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0xe32e33 (0x7f2bbbc32e33 in /mnt/data/vllm/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd44a3 (0x7f2c07cd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: + 0x89134 (0x7f2c09461134 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x1097dc (0x7f2c094e17dc in /lib/x86_64-linux-gnu/libc.so.6) [1] 150030 IOT instruction NCCL_DEBUG=TRACE VLLM_NCCL_SO_PATH=~/.config/vllm/nccl/cu12/libnccl.so.2.18.1 ```

Output of collect_env.py

Output:

```console Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Debian GNU/Linux 12 (bookworm) (x86_64) GCC version: (Debian 12.2.0-14) 12.2.0 Clang version: Could not collect CMake version: version 3.29.3 Libc version: glibc-2.36 Python version: 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] (64-bit runtime) Python platform: Linux-6.1.0-21-cloud-amd64-x86_64-with-glibc2.36 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G GPU 1: NVIDIA A10G GPU 2: NVIDIA A10G GPU 3: NVIDIA A10G Nvidia driver version: 555.42.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: AuthenticAMD Model name: AMD EPYC 7R32 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 0 BogoMIPS: 5600.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid Hypervisor vendor: KVM Virtualization type: full L1d cache: 768 KiB (24 instances) L1i cache: 768 KiB (24 instances) L2 cache: 12 MiB (24 instances) L3 cache: 96 MiB (6 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-47 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] mypy-protobuf==3.5.0 [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] transformers==4.41.2 [pip3] triton==2.3.0 [pip3] vllm-nccl-cu12==2.18.1.0.4.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.4.3 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB PHB PHB 0-47 0 N/A GPU1 PHB X PHB PHB 0-47 0 N/A GPU2 PHB PHB X PHB 0-47 0 N/A GPU3 PHB PHB PHB X 0-47 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge ```

as-herr commented 3 weeks ago

Running into the same problem as OP on k8s 1.24, have tried with 0.4.2 and 0.4.3 with no success, would greatly appreciate a fix or guidance

dtrifiro commented 2 weeks ago

I'm seeing a similar issue,although not on openshift (AWS g5 instance with 4xA10G).

It seems that the problem is that the instance I does not support nccl p2p.

To disable it:

export NCCL_P2P_DISABLE=1

Also see: https://github.com/vllm-project/vllm/issues/5458#issuecomment-2173270848

vllm-project / vllm

[Bug]: Multi GPU setup for VLLM in Openshift still does not work #5360

Your current environment

🐛 Describe the bug