hz20091942 commented 1 month ago

Your current environment

The output of `python collect_env.py`

```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 24.04.1 LTS (x86_64) GCC version: (Ubuntu 13.2.0-23ubuntu4) 13.2.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.39 Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.8.0-44-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.4.99 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10 GPU 1: NVIDIA A10 GPU 2: NVIDIA A10 GPU 3: NVIDIA A10 GPU 4: NVIDIA A10 GPU 5: NVIDIA A10 GPU 6: NVIDIA A10 Nvidia driver version: 550.107.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 104 On-line CPU(s) list: 0-103 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation Model name: Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz BIOS Model name: Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz CPU @ 2.2GHz BIOS CPU family: 179 CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 26 Socket(s): 2 Stepping: 6 CPU(s) scaling MHz: 24% CPU max MHz: 3400.0000 CPU min MHz: 800.0000 BogoMIPS: 4400.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 2.4 MiB (52 instances) L1i cache: 1.6 MiB (52 instances) L2 cache: 65 MiB (52 instances) L3 cache: 78 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-25,52-77 NUMA node1 CPU(s): 26-51,78-103 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.6.68 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.44.2 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.0@32e7db25365415841ebc7c4215851743fbb1bad1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX PXB PXB PXB PXB PXB 0-25,52-77 0 N/A GPU1 PIX X PXB PXB PXB PXB PXB 0-25,52-77 0 N/A GPU2 PXB PXB X PXB PXB PXB PXB 0-25,52-77 0 N/A GPU3 PXB PXB PXB X PXB PXB PXB 0-25,52-77 0 N/A GPU4 PXB PXB PXB PXB X PXB PXB 0-25,52-77 0 N/A GPU5 PXB PXB PXB PXB PXB X PXB 0-25,52-77 0 N/A GPU6 PXB PXB PXB PXB PXB PXB X 0-25,52-77 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Model Input Dumps

No response

🐛 Describe the bug

After every reboot of my GPU machine，

The NCCL test script on the official website [Getting Started/Debugging Tips] was successful when running.

Running scripts in the conda virtual environment

python -m vllm.entrypoints.openai.api_server --model /data1/xinference/modelscope/hub/qwen/Qwen2-72B-Instruct-GPTQ-Int4 --served-model-name qwen2-72b --tensor-parallel-size 4

Or run the container in Docker

docker run --gpus all --name vllm \
-v /data1/xinference/modelscope:/root/model \
-v /data1/vllm:/root/vllm \
-p 8880:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model /root/model/hub/qwen/Qwen2-72B-Instruct-GPTQ-Int4 \
--gpu-memory-utilization 0.8 \
--tensor-parallel-size 4 \
--max-model-len 8129 \
--served-model-name Qwen2-72b-instruct

All of them are successful.

After exiting the running model, or exiting the Docker container, closing the container, or even shutting down the Docker service, when I want to run the script in the conda virtual environment or Docker as in the step 2 above, the model cannot load and the program get stuck. Among them, the console output of running the container in Docker is as follows:

(base) root@llmgpu01:/home/hz# docker run --gpus all --name vllm-qwen \
-v /data1/xinference/modelscope:/root/model \
-v /data1/vllm:/root/vllm \
-p 8880:8000 \
-e VLLM_LOGGING_LEVEL=DEBUG \
-e CUDA_LAUNCH_BLOCKING=1 \
-e NCCL_DEBUG=TRACE \
-e VLLM_TRACE_FUNCTION=1 \
--ipc=host \
vllm/vllm-openai:latest \
--model /root/model/hub/qwen/Qwen2-72B-Instruct-GPTQ-Int4 \
--gpu-memory-utilization 0.8 \
--tensor-parallel-size 4 \
--max-model-len 8129 \
--served-model-name Qwen2-72b-instruct
INFO 09-12 22:35:53 api_server.py:459] vLLM API server version 0.6.0
INFO 09-12 22:35:53 api_server.py:460] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/root/model/hub/qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8129, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen2-72b-instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 09-12 22:35:53 gptq_marlin.py:102] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 09-12 22:35:53 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/43b2554b-d0e7-455b-b354-1227e7e84bea for RPC Path.
INFO 09-12 22:35:53 api_server.py:176] Started engine process with PID 78
INFO 09-12 22:35:56 gptq_marlin.py:102] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 09-12 22:35:56 config.py:890] Defaulting to use mp for distributed inference
INFO 09-12 22:35:56 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='/root/model/hub/qwen/Qwen2-72B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/root/model/hub/qwen/Qwen2-72B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8129, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2-72b-instruct, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
WARNING 09-12 22:35:57 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 52 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-12 22:35:57 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=209) WARNING 09-12 22:35:57 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=209) INFO 09-12 22:35:57 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-bc059fd3d0224c9fb3f46c6f373c79f9/VLLM_TRACE_FUNCTION_for_process_209_thread_123791027767104_at_2024-09-12_22:35:57.114681.log
(VllmWorkerProcess pid=210) WARNING 09-12 22:35:57 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=210) INFO 09-12 22:35:57 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-bc059fd3d0224c9fb3f46c6f373c79f9/VLLM_TRACE_FUNCTION_for_process_210_thread_123791027767104_at_2024-09-12_22:35:57.122202.log
WARNING 09-12 22:35:57 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 09-12 22:35:57 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-bc059fd3d0224c9fb3f46c6f373c79f9/VLLM_TRACE_FUNCTION_for_process_78_thread_123791027767104_at_2024-09-12_22:35:57.131361.log
(VllmWorkerProcess pid=211) WARNING 09-12 22:35:57 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=211) INFO 09-12 22:35:57 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-bc059fd3d0224c9fb3f46c6f373c79f9/VLLM_TRACE_FUNCTION_for_process_211_thread_123791027767104_at_2024-09-12_22:35:57.132131.log
(VllmWorkerProcess pid=209) INFO 09-12 22:35:57 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=210) INFO 09-12 22:35:57 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=211) INFO 09-12 22:35:57 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
DEBUG 09-12 22:35:59 parallel_state.py:845] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:48341 backend=nccl
(VllmWorkerProcess pid=210) DEBUG 09-12 22:35:59 parallel_state.py:845] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:48341 backend=nccl
(VllmWorkerProcess pid=209) DEBUG 09-12 22:35:59 parallel_state.py:845] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:48341 backend=nccl
(VllmWorkerProcess pid=211) DEBUG 09-12 22:35:59 parallel_state.py:845] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:48341 backend=nccl
(VllmWorkerProcess pid=210) INFO 09-12 22:35:59 utils.py:977] Found nccl from library libnccl.so.2
INFO 09-12 22:35:59 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=209) INFO 09-12 22:35:59 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=211) INFO 09-12 22:35:59 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=210) INFO 09-12 22:35:59 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 09-12 22:35:59 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=211) INFO 09-12 22:35:59 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=209) INFO 09-12 22:35:59 pynccl.py:63] vLLM is using nccl==2.20.5
11aaa61a175c:78:78 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
11aaa61a175c:78:78 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
11aaa61a175c:78:78 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda12.4

At this point, running the NCCL test script on the official website [Getting Started/Debugging Tips] again also get stuck, and the console output is as follows:


(vllmenv) root@llmgpu01:/home/hz/vllm_code# CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_DEBUG=TRACE torchrun --nproc-per-node=4 test.py
W0913 05:21:44.715000 125815160129344 torch/distributed/run.py:779]
W0913 05:21:44.715000 125815160129344 torch/distributed/run.py:779] *****************************************
W0913 05:21:44.715000 125815160129344 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0913 05:21:44.715000 125815160129344 torch/distributed/run.py:779] *****************************************
llmgpu01:57120:57120 [0] NCCL INFO Bootstrap : Using ens1f3:10.141.69.40<0>
llmgpu01:57120:57120 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
llmgpu01:57120:57120 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda12.4
llmgpu01:57121:57121 [1] NCCL INFO cudaDriverVersion 12040
llmgpu01:57121:57121 [1] NCCL INFO Bootstrap : Using ens1f3:10.141.69.40<0>
llmgpu01:57121:57121 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
llmgpu01:57122:57122 [2] NCCL INFO cudaDriverVersion 12040
llmgpu01:57122:57122 [2] NCCL INFO Bootstrap : Using ens1f3:10.141.69.40<0>
llmgpu01:57122:57122 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
llmgpu01:57123:57123 [3] NCCL INFO cudaDriverVersion 12040
llmgpu01:57123:57123 [3] NCCL INFO Bootstrap : Using ens1f3:10.141.69.40<0>
llmgpu01:57123:57123 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
llmgpu01:57120:57171 [0] NCCL INFO NET/IB : No device found.
llmgpu01:57120:57171 [0] NCCL INFO NET/Socket : Using [0]ens1f3:10.141.69.40<0> [1]bond0:fe80::9476:cff:fe76:1232%bond0<0>
llmgpu01:57120:57171 [0] NCCL INFO Using non-device net plugin version 0
llmgpu01:57120:57171 [0] NCCL INFO Using network Socket
llmgpu01:57123:57174 [3] NCCL INFO NET/IB : No device found.
llmgpu01:57121:57172 [1] NCCL INFO NET/IB : No device found.
llmgpu01:57123:57174 [3] NCCL INFO NET/Socket : Using [0]ens1f3:10.141.69.40<0> [1]bond0:fe80::9476:cff:fe76:1232%bond0<0>
llmgpu01:57121:57172 [1] NCCL INFO NET/Socket : Using [0]ens1f3:10.141.69.40<0> [1]bond0:fe80::9476:cff:fe76:1232%bond0<0>
llmgpu01:57123:57174 [3] NCCL INFO Using non-device net plugin version 0
llmgpu01:57121:57172 [1] NCCL INFO Using non-device net plugin version 0
llmgpu01:57123:57174 [3] NCCL INFO Using network Socket
llmgpu01:57121:57172 [1] NCCL INFO Using network Socket
llmgpu01:57122:57173 [2] NCCL INFO NET/IB : No device found.
llmgpu01:57122:57173 [2] NCCL INFO NET/Socket : Using [0]ens1f3:10.141.69.40<0> [1]bond0:fe80::9476:cff:fe76:1232%bond0<0>
llmgpu01:57122:57173 [2] NCCL INFO Using non-device net plugin version 0
llmgpu01:57122:57173 [2] NCCL INFO Using network Socket
llmgpu01:57120:57171 [0] NCCL INFO comm 0x80450d0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 14000 commId 0x96d601d529f775a4 - Init START
llmgpu01:57123:57174 [3] NCCL INFO comm 0x7a8cd10 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 1e000 commId 0x96d601d529f775a4 - Init START
llmgpu01:57122:57173 [2] NCCL INFO comm 0x8216dd0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 18000 commId 0x96d601d529f775a4 - Init START
llmgpu01:57121:57172 [1] NCCL INFO comm 0x7a156f0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 15000 commId 0x96d601d529f775a4 - Init START
llmgpu01:57123:57174 [3] NCCL INFO Setting affinity for GPU 3 to 3fff,fff00000,03ffffff
llmgpu01:57123:57174 [3] NCCL INFO NVLS multicast support is not available on dev 3
llmgpu01:57122:57173 [2] NCCL INFO Setting affinity for GPU 2 to 3fff,fff00000,03ffffff
llmgpu01:57122:57173 [2] NCCL INFO NVLS multicast support is not available on dev 2
llmgpu01:57120:57171 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff
llmgpu01:57120:57171 [0] NCCL INFO NVLS multicast support is not available on dev 0
llmgpu01:57121:57172 [1] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff
llmgpu01:57121:57172 [1] NCCL INFO NVLS multicast support is not available on dev 1
llmgpu01:57123:57174 [3] NCCL INFO comm 0x7a8cd10 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
llmgpu01:57122:57173 [2] NCCL INFO comm 0x8216dd0 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
llmgpu01:57121:57172 [1] NCCL INFO comm 0x7a156f0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
llmgpu01:57122:57173 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1
llmgpu01:57120:57171 [0] NCCL INFO comm 0x80450d0 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
llmgpu01:57122:57173 [2] NCCL INFO P2P Chunksize set to 131072
llmgpu01:57121:57172 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
llmgpu01:57123:57174 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2
llmgpu01:57121:57172 [1] NCCL INFO P2P Chunksize set to 131072
llmgpu01:57120:57171 [0] NCCL INFO Channel 00/04 :    0   1   2   3
llmgpu01:57123:57174 [3] NCCL INFO P2P Chunksize set to 131072
llmgpu01:57120:57171 [0] NCCL INFO Channel 01/04 :    0   1   2   3
llmgpu01:57120:57171 [0] NCCL INFO Channel 02/04 :    0   1   2   3
llmgpu01:57120:57171 [0] NCCL INFO Channel 03/04 :    0   1   2   3
llmgpu01:57120:57171 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
llmgpu01:57120:57171 [0] NCCL INFO P2P Chunksize set to 131072
llmgpu01:57121:57172 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
llmgpu01:57122:57173 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
llmgpu01:57121:57172 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
llmgpu01:57122:57173 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
llmgpu01:57121:57172 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM
llmgpu01:57122:57173 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM
llmgpu01:57123:57174 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
llmgpu01:57121:57172 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM
llmgpu01:57122:57173 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM
llmgpu01:57123:57174 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
llmgpu01:57120:57171 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
llmgpu01:57123:57174 [3] NCCL INFO Channel 02/0 : 3[3] -> 0[0] via P2P/CUMEM
llmgpu01:57120:57171 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
llmgpu01:57123:57174 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/CUMEM
llmgpu01:57120:57171 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
llmgpu01:57120:57171 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
llmgpu01:57122:57173 [2] NCCL INFO Connected all rings
llmgpu01:57120:57171 [0] NCCL INFO Connected all rings
llmgpu01:57121:57172 [1] NCCL INFO Connected all rings
llmgpu01:57123:57174 [3] NCCL INFO Connected all rings
llmgpu01:57123:57174 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
llmgpu01:57123:57174 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
llmgpu01:57123:57174 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM
llmgpu01:57123:57174 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM
llmgpu01:57122:57173 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
llmgpu01:57121:57172 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
llmgpu01:57122:57173 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
llmgpu01:57121:57172 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
llmgpu01:57122:57173 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM
llmgpu01:57121:57172 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
llmgpu01:57122:57173 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM
llmgpu01:57121:57172 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
llmgpu01:57120:57171 [0] NCCL INFO Connected all trees
llmgpu01:57120:57171 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
llmgpu01:57120:57171 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
llmgpu01:57121:57172 [1] NCCL INFO Connected all trees
llmgpu01:57121:57172 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
llmgpu01:57121:57172 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
llmgpu01:57123:57174 [3] NCCL INFO Connected all trees
llmgpu01:57122:57173 [2] NCCL INFO Connected all trees
llmgpu01:57122:57173 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
llmgpu01:57123:57174 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
llmgpu01:57122:57173 [2] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
llmgpu01:57123:57174 [3] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
llmgpu01:57122:57173 [2] NCCL INFO comm 0x8216dd0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 18000 commId 0x96d601d529f775a4 - Init COMPLETE
llmgpu01:57120:57171 [0] NCCL INFO comm 0x80450d0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 14000 commId 0x96d601d529f775a4 - Init COMPLETE
llmgpu01:57121:57172 [1] NCCL INFO comm 0x7a156f0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 15000 commId 0x96d601d529f775a4 - Init COMPLETE
llmgpu01:57123:57174 [3] NCCL INFO comm 0x7a8cd10 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 1e000 commId 0x96d601d529f775a4 - Init COMPLETE
[rank3]:[E913 05:31:48.671656405 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600038 milliseconds before timing out.
[rank3]:[E913 05:31:48.673878020 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E913 05:31:48.676577933 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600046 milliseconds before timing out.
[rank2]:[E913 05:31:48.677396512 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E913 05:31:48.698410832 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
[rank1]:[E913 05:31:48.699549516 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E913 05:31:48.702702155 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
[rank0]:[E913 05:31:48.703782978 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
llmgpu01:57120:57180 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/hz/vllm_code/test.py", line 16, in <module>
[rank0]:     assert value == world_size, f"Expected {world_size}, got {value}"
[rank0]: AssertionError: Expected 4, got 1.0
[rank0]:[W913 05:31:48.018948929 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
llmgpu01:57123:57178 [3] NCCL INFO [Service thread] Connection closed by localRank 3
llmgpu01:57122:57175 [2] NCCL INFO [Service thread] Connection closed by localRank 2
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/hz/vllm_code/test.py", line 16, in <module>
[rank3]:     assert value == world_size, f"Expected {world_size}, got {value}"
[rank3]: AssertionError: Expected 4, got 0.0
llmgpu01:57121:57176 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/hz/vllm_code/test.py", line 16, in <module>
[rank2]:     assert value == world_size, f"Expected {world_size}, got {value}"
[rank2]: AssertionError: Expected 4, got 0.0
llmgpu01:57123:57152 [0] NCCL INFO comm 0x7a8cd10 rank 3 nranks 4 cudaDev 3 busId 1e000 - Abort COMPLETE
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/hz/vllm_code/test.py", line 16, in <module>
[rank1]:     assert value == world_size, f"Expected {world_size}, got {value}"
[rank1]: AssertionError: Expected 4, got 0.0
llmgpu01:57121:57156 [0] NCCL INFO comm 0x7a156f0 rank 1 nranks 4 cudaDev 1 busId 15000 - Abort COMPLETE
llmgpu01:57122:57154 [0] NCCL INFO comm 0x8216dd0 rank 2 nranks 4 cudaDev 2 busId 18000 - Abort COMPLETE
[rank3]:[E913 05:31:49.221511238 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E913 05:31:49.221547176 ProcessGroupNCCL.cpp:621] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E913 05:31:49.221557609 ProcessGroupNCCL.cpp:627] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E913 05:31:49.224349198 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600038 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b60b092ef86 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7b6061bc88d2 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b6061bcf313 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7b6061bd16fc in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7b60af6dbbf4 in /usr/local/miniconda3/envs/vllmenv/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x9ca94 (0x7b60b149ca94 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7b60b1529c3c in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E913 05:31:49.230512045 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1. [rank1]:[E913 05:31:49.230536304 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E913 05:31:49.230544762 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank2]:[E913 05:31:49.230718430 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1. [rank2]:[E913 05:31:49.230739714 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank2]:[E913 05:31:49.230746779 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E913 05:31:49.232083952 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78789b199f86 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x78784c3c88d2 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78784c3cf313 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x78784c3d16fc in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x787899edbbf4 in /usr/local/miniconda3/envs/vllmenv/bin/../lib/libstdc++.so.6) frame #5: + 0x9ca94 (0x78789bc9ca94 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x129c3c (0x78789bd29c3c in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E913 05:31:49.232247737 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78d59a167f86 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x78d54b3c88d2 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78d54b3cf313 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x78d54b3d16fc in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x78d598edbbf4 in /usr/local/miniconda3/envs/vllmenv/bin/../lib/libstdc++.so.6) frame #5: + 0x9ca94 (0x78d59ac9ca94 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x129c3c (0x78d59ad29c3c in /usr/lib/x86_64-linux-gnu/libc.so.6)

llmgpu01:57120:57158 [0] NCCL INFO comm 0x80450d0 rank 0 nranks 4 cudaDev 0 busId 14000 - Abort COMPLETE [rank0]:[E913 05:31:49.400242426 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1. [rank0]:[E913 05:31:49.400271008 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E913 05:31:49.400283907 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [rank0]:[E913 05:31:49.403822047 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f1f883ddf86 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f1f395c88d2 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f1f395cf313 in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1f395d16fc in /usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f1f870dbbf4 in /usr/local/miniconda3/envs/vllmenv/bin/../lib/libstdc++.so.6) frame #5: + 0x9ca94 (0x7f1f8909ca94 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x129c3c (0x7f1f89129c3c in /usr/lib/x86_64-linux-gnu/libc.so.6)

W0913 05:31:49.562000 125815160129344 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 57120 closing signal SIGTERM W0913 05:31:49.563000 125815160129344 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 57121 closing signal SIGTERM W0913 05:31:49.563000 125815160129344 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 57122 closing signal SIGTERM E0913 05:31:49.878000 125815160129344 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 3 (pid: 57123) of binary: /usr/local/miniconda3/envs/vllmenv/bin/python Traceback (most recent call last): File "/usr/local/miniconda3/envs/vllmenv/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, kwargs) File "/usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/miniconda3/envs/vllmenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

test.py FAILED

Failures:

DarkLight1337 commented 1 month ago

Is this on the latest patch version of vLLM?

youkaichao commented 1 month ago

looks like a machine / hardware problem. make sure you try it in different machines.

hz20091942 commented 1 month ago

Is this on the latest patch version of vLLM?

version = 0.6.0

hz20091942 commented 1 month ago

I have 2 GPU machine, and I installed CUDA 12.2 on the other machine, but this issue still persists

hz20091942 commented 1 month ago

looks like a machine / hardware problem. make sure you try it in different machines.

I used OpenMMLab's MMDetection library for multi GPU pre training on the COCO dataset and did not appear this exception. The training can be restarted many times. But when I use VLLM to load the model and exit, mmdetection training will no longer be able to start. So I still suspect that there may be a bug in the vllm library. Hope to receive an answer. Thanks.

(mmdet) root@llmgpu02:/home/hz/mmdetection# CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh ./configs/rtmdet/rtmdet_l_8xb32-300e_coco_.py 4
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/torch/distributed/launch.py:208: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  main()
W0913 13:26:23.831864 137595158689600 torch/distributed/run.py:779]
W0913 13:26:23.831864 137595158689600 torch/distributed/run.py:779] *****************************************
W0913 13:26:23.831864 137595158689600 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0913 13:26:23.831864 137595158689600 torch/distributed/run.py:779] *****************************************
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/optim/optimizer/zero_optimizer.py:11: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
  from torch.distributed.optim import \
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/optim/optimizer/zero_optimizer.py:11: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
  from torch.distributed.optim import \
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/optim/optimizer/zero_optimizer.py:11: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
  from torch.distributed.optim import \
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/optim/optimizer/zero_optimizer.py:11: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
  from torch.distributed.optim import \
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
09/13 13:26:29 - mmengine - INFO -
------------------------------------------------------------
System environment:
    sys.platform: linux
    Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
    CUDA available: True
    MUSA available: False
    numpy_random_seed: 1733243605
    GPU 0,1,2,3: NVIDIA A10
    CUDA_HOME: /usr/local/cuda-12.2
    NVCC: Cuda compilation tools, release 12.2, V12.2.91
    GCC: gcc (Ubuntu 12.3.0-17ubuntu1) 12.3.0
    PyTorch: 2.4.1
    PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.4
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 90.1
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

    TorchVision: 0.19.1
    OpenCV: 4.10.0
    MMEngine: 0.10.4

Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: 1733243605
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 4
------------------------------------------------------------
…………

creating index...
index created!
09/13 13:27:07 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
09/13 13:27:07 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
09/13 13:27:07 - mmengine - INFO - Checkpoints will be saved to /home/hz/mmdetection/work_dirs/rtmdet_l_8xb32-300e_coco_.
/home/hz/mmdetection/mmdet/models/layers/se_layer.py:158: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/layers/se_layer.py:158: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/layers/se_layer.py:158: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/layers/se_layer.py:158: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/home/hz/mmdetection/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/torch/functional.py:513: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1724789116784/work/aten/src/ATen/native/TensorShape.cpp:3609.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/torch/functional.py:513: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1724789116784/work/aten/src/ATen/native/TensorShape.cpp:3609.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/torch/functional.py:513: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1724789116784/work/aten/src/ATen/native/TensorShape.cpp:3609.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/usr/local/miniconda3/envs/mmdet/lib/python3.8/site-packages/torch/functional.py:513: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1724789116784/work/aten/src/ATen/native/TensorShape.cpp:3609.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
09/13 13:27:39 - mmengine - INFO - Epoch(train)   [1][  50/2566]  base_lr: 1.9623e-04 lr: 1.9623e-04  eta: 5 days, 19:40:31  time: 0.6532  data_time: 0.0189  memory: 12354  loss: 2.3154  loss_cls: 0.9985  loss_bbox: 1.3169
09/13 13:28:11 - mmengine - INFO - Epoch(train)   [1][ 100/2566]  base_lr: 3.9643e-04 lr: 3.9643e-04  eta: 5 days, 16:43:34  time: 0.6257  data_time: 0.0032  memory: 12397  loss: 2.1314  loss_cls: 1.0023  loss_bbox: 1.1291
09/13 13:28:42 - mmengine - INFO - Epoch(train)   [1][ 150/2566]  base_lr: 5.9663e-04 lr: 5.9663e-04  eta: 5 days, 16:05:45  time: 0.6308  data_time: 0.0032  memory: 12633  loss: 1.9041  loss_cls: 0.9908  loss_bbox: 0.9133
09/13 13:29:14 - mmengine - INFO - Epoch(train)   [1][ 200/2566]  base_lr: 7.9683e-04 lr: 7.9683e-04  eta: 5 days, 15:58:32  time: 0.6345  data_time: 0.0032  memory: 12540  loss: 2.1094  loss_cls: 1.1339  loss_bbox: 0.9755
09/13 13:29:46 - mmengine - INFO - Epoch(train)   [1][ 250/2566]  base_lr: 9.9703e-04 lr: 9.9703e-04  eta: 5 days, 15:54:48  time: 0.6348  data_time: 0.0033  memory: 12980  loss: 1.9955  loss_cls: 1.1089  loss_bbox: 0.8866
09/13 13:30:18 - mmengine - INFO - Epoch(train)   [1][ 300/2566]  base_lr: 1.1972e-03 lr: 1.1972e-03  eta: 5 days, 15:55:26  time: 0.6364  data_time: 0.0032  memory: 12692  loss: 2.0213  loss_cls: 1.1336  loss_bbox: 0.8877
09/13 13:30:49 - mmengine - INFO - Epoch(train)   [1][ 350/2566]  base_lr: 1.3974e-03 lr: 1.3974e-03  eta: 5 days, 16:01:42  time: 0.6396  data_time: 0.0032  memory: 12683  loss: 2.0868  loss_cls: 1.1701  loss_bbox: 0.9166
09/13 13:31:21 - mmengine - INFO - Epoch(train)   [1][ 400/2566]  base_lr: 1.5976e-03 lr: 1.5976e-03  eta: 5 days, 15:55:55  time: 0.6332  data_time: 0.0032  memory: 12370  loss: 2.0334  loss_cls: 1.1523  loss_bbox: 0.8811

GPU info when MMDetections trainning:

(base) root@llmgpu02:/data1/coco# nvidia-smi
Fri Sep 13 13:39:50 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02             Driver Version: 550.107.02     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10                     Off |   00000000:14:00.0 Off |                    0 |
|  0%   67C    P0            131W /  150W |   14739MiB /  23028MiB |     64%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10                     Off |   00000000:15:00.0 Off |                    0 |
|  0%   67C    P0            128W /  150W |   13793MiB /  23028MiB |     76%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10                     Off |   00000000:18:00.0 Off |                    0 |
|  0%   67C    P0            127W /  150W |   13447MiB /  23028MiB |     72%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10                     Off |   00000000:1E:00.0 Off |                    0 |
|  0%   67C    P0            132W /  150W |   13919MiB /  23028MiB |     69%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A10                     Off |   00000000:21:00.0 Off |                    0 |
|  0%   35C    P8             11W /  150W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A10                     Off |   00000000:25:00.0 Off |                    0 |
|  0%   34C    P8              9W /  150W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A10                     Off |   00000000:2D:00.0 Off |                    0 |
|  0%   35C    P8             10W /  150W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     55440      C   ...al/miniconda3/envs/mmdet/bin/python      14730MiB |
|    1   N/A  N/A     55441      C   ...al/miniconda3/envs/mmdet/bin/python      13784MiB |
|    2   N/A  N/A     55442      C   ...al/miniconda3/envs/mmdet/bin/python      13438MiB |
|    3   N/A  N/A     55443      C   ...al/miniconda3/envs/mmdet/bin/python      13910MiB |
+-----------------------------------------------------------------------------------------+
(base) root@llmgpu02:/data1/coco#

DarkLight1337 commented 1 month ago

I used OpenMMLab's MMDetection library for multi GPU pre training on the COCO dataset and did not appear this exception. The training can be restarted many times. But when I use VLLM to load the model and exit, mmdetection training will no longer be able to start.

Which GPUs are you using for each process?

hz20091942 commented 1 month ago

I used OpenMMLab's MMDetection library for multi GPU pre training on the COCO dataset and did not appear this exception. The training can be restarted many times. But when I use VLLM to load the model and exit, mmdetection training will no longer be able to start.

Which GPUs are you using for each process?

GPU 0-3

DarkLight1337 commented 1 month ago

Are you using both on the same GPUs simultaneously? In the log that you showed earlier, MMDetection already uses 14 GB / 23 GB memory. Given that you've set --gpu-memory-utilization 0.8 for vLLM, there is not enough memory left in the GPUs to run both.

hz20091942 commented 1 month ago

Are you using both on the same GPUs simultaneously? In the log that you showed earlier, MMDetection already uses 14 GB / 23 GB memory. Given that you've set --gpu-memory-utilization 0.8 for vLLM, there is not enough memory left in the GPUs to run both.

No, starting mmdetection training and using vllm to load large models are independent and not run at the same time

youkaichao commented 1 month ago

did you try it in another machine?

vllm-project / vllm

[Bug]: GPU can only load the model once, it gets stuck when loaded again #8444

Your current environment

Model Input Dumps

🐛 Describe the bug

test.py FAILED