Open rchen19 opened 1 day ago
Please go through the troubleshooting guide first and see if it can solve your problem.
I have produced some debugging logs, but I cannot figure out where it went wrong. And btw, I was able to serve a different model minicpm-llama3-v-2-5
using the same environment without a problem. Any help is appreciated. Thanks.
Adding debug flags and re-running the server command produced the following log
(vllm) user@cuda0003:/export/home/dev/huggingface$ CUDA_VISIBLE_DEVICES=2,3,4,5 VLLM_LOGGING_LEVEL=DEBUG CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=TRACE VLLM_TRACE_FUNCTION=1 vl
lm serve llama-3-2-11b-vision-instruct --host 0.0.0.0 --port 8006 --served-model-name llama-3-2-11b-vision-instruct --tensor-parallel-size 4 --enforce_eager
INFO 10-01 13:21:43 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-01 13:21:43 api_server.py:527] args: Namespace(model_tag='llama-3-2-11b-vision-instruct', config='', host='0.0.0.0', port=8006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='llama-3-2-11b-vision-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama-3-2-11b-vision-instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7f8ce606c5e0>)
INFO 10-01 13:21:43 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/2b6b90ba-21bd-48aa-857a-fe9e3d33af36 for IPC Path.
INFO 10-01 13:21:43 api_server.py:177] Started engine process with PID 2752520
INFO 10-01 13:21:44 config.py:899] Defaulting to use mp for distributed inference
WARNING 10-01 13:21:44 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 10-01 13:21:44 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-01 13:21:46 config.py:899] Defaulting to use mp for distributed inference
WARNING 10-01 13:21:46 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 10-01 13:21:46 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-01 13:21:46 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='llama-3-2-11b-vision-instruct', speculative_config=None, tokenizer='llama-3-2-11b-vision-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama-3-2-11b-vision-instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 10-01 13:21:46 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 10-01 13:21:47 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=2752869) WARNING 10-01 13:21:47 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:47 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b74bec7046e243f5a4c285d2c6cce1d9/VLLM_TRACE_FUNCTION_for_process_2752869_thread_139763790509888_at_2024-10-01_13:21:47.037250.log
(VllmWorkerProcess pid=2752870) WARNING 10-01 13:21:47 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:47 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b74bec7046e243f5a4c285d2c6cce1d9/VLLM_TRACE_FUNCTION_for_process_2752870_thread_139763790509888_at_2024-10-01_13:21:47.043849.log
WARNING 10-01 13:21:47 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 10-01 13:21:47 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b74bec7046e243f5a4c285d2c6cce1d9/VLLM_TRACE_FUNCTION_for_process_2752520_thread_139763790509888_at_2024-10-01_13:21:47.046676.log
(VllmWorkerProcess pid=2752871) WARNING 10-01 13:21:47 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:47 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b74bec7046e243f5a4c285d2c6cce1d9/VLLM_TRACE_FUNCTION_for_process_2752871_thread_139763790509888_at_2024-10-01_13:21:47.047104.log
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:47 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 10-01 13:21:47 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:47 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:47 selector.py:116] Using XFormers backend.
INFO 10-01 13:21:47 selector.py:116] Using XFormers backend.
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:47 selector.py:116] Using XFormers backend.
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:47 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:47 selector.py:116] Using XFormers backend.
/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2752870) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752870) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2752869) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752869) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2752871) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752871) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2752870) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752870) @torch.library.impl_abstract("xformers_flash::flash_bwd")
/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=2752869) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752869) @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=2752871) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752871) @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:49 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:49 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:49 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
DEBUG 10-01 13:21:50 parallel_state.py:937] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:34617 backend=nccl
[W1001 13:21:50.261356939 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:34617 (errno: 97 - Address family not supported by protocol).
(VllmWorkerProcess pid=2752869) DEBUG 10-01 13:21:50 parallel_state.py:937] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:34617 backend=nccl
[W1001 13:21:50.365314920 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:34617 (errno: 97 - Address family not supported by protocol).
(VllmWorkerProcess pid=2752871) DEBUG 10-01 13:21:50 parallel_state.py:937] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:34617 backend=nccl
(VllmWorkerProcess pid=2752870) DEBUG 10-01 13:21:50 parallel_state.py:937] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:34617 backend=nccl
[W1001 13:21:50.390135801 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:34617 (errno: 97 - Address family not supported by protocol).
[W1001 13:21:50.393598072 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:34617 (errno: 97 - Address family not supported by protocol).
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:50 utils.py:992] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:50 utils.py:992] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:50 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:50 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 10-01 13:21:50 utils.py:992] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:50 utils.py:992] Found nccl from library libnccl.so.2
INFO 10-01 13:21:50 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:50 pynccl.py:63] vLLM is using nccl==2.20.5
cuda0003:2752520:2752520 [0] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2752520:2752520 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2752520:2752520 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
cuda0003:2752871:2752871 [3] NCCL INFO cudaDriverVersion 12020
cuda0003:2752869:2752869 [1] NCCL INFO cudaDriverVersion 12020
cuda0003:2752870:2752870 [2] NCCL INFO cudaDriverVersion 12020
cuda0003:2752871:2752871 [3] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2752869:2752869 [1] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2752870:2752870 [2] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2752871:2752871 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2752869:2752869 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2752870:2752870 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2752869:2752869 [1] NCCL INFO NET/IB : No device found.
cuda0003:2752869:2752869 [1] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2752869:2752869 [1] NCCL INFO Using non-device net plugin version 0
cuda0003:2752869:2752869 [1] NCCL INFO Using network Socket
cuda0003:2752871:2752871 [3] NCCL INFO NET/IB : No device found.
cuda0003:2752871:2752871 [3] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2752871:2752871 [3] NCCL INFO Using non-device net plugin version 0
cuda0003:2752871:2752871 [3] NCCL INFO Using network Socket
cuda0003:2752520:2752520 [0] NCCL INFO NET/IB : No device found.
cuda0003:2752520:2752520 [0] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2752520:2752520 [0] NCCL INFO Using non-device net plugin version 0
cuda0003:2752520:2752520 [0] NCCL INFO Using network Socket
cuda0003:2752870:2752870 [2] NCCL INFO NET/IB : No device found.
cuda0003:2752870:2752870 [2] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2752870:2752870 [2] NCCL INFO Using non-device net plugin version 0
cuda0003:2752870:2752870 [2] NCCL INFO Using network Socket
cuda0003:2752870:2752870 [2] NCCL INFO comm 0x5599765d84d0 rank 2 nranks 4 cudaDev 2 nvmlDev 4 busId 81000 commId 0xd493450290593fa - Init START
cuda0003:2752869:2752869 [1] NCCL INFO comm 0x5599765a7070 rank 1 nranks 4 cudaDev 1 nvmlDev 3 busId 61000 commId 0xd493450290593fa - Init START
cuda0003:2752520:2752520 [0] NCCL INFO comm 0x5599765a56d0 rank 0 nranks 4 cudaDev 0 nvmlDev 2 busId 41000 commId 0xd493450290593fa - Init START
cuda0003:2752871:2752871 [3] NCCL INFO comm 0x5599765dd3b0 rank 3 nranks 4 cudaDev 3 nvmlDev 5 busId a1000 commId 0xd493450290593fa - Init START
cuda0003:2752870:2752870 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
cuda0003:2752869:2752869 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
cuda0003:2752871:2752871 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
cuda0003:2752520:2752520 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
cuda0003:2752870:2752870 [2] NCCL INFO Setting affinity for GPU 4 to ffff0000,ffff0000
cuda0003:2752869:2752869 [1] NCCL INFO Setting affinity for GPU 3 to ffff,0000ffff
cuda0003:2752869:2752869 [1] NCCL INFO NVLS multicast support is not available on dev 1
cuda0003:2752871:2752871 [3] NCCL INFO Setting affinity for GPU 5 to ffff0000,ffff0000
cuda0003:2752520:2752520 [0] NCCL INFO Setting affinity for GPU 2 to ffff,0000ffff
cuda0003:2752520:2752520 [0] NCCL INFO NVLS multicast support is not available on dev 0
cuda0003:2752871:2752871 [3] NCCL INFO NVLS multicast support is not available on dev 3
cuda0003:2752870:2752870 [2] NCCL INFO NVLS multicast support is not available on dev 2
cuda0003:2752869:2752869 [1] NCCL INFO comm 0x5599765a7070 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
cuda0003:2752870:2752870 [2] NCCL INFO comm 0x5599765d84d0 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
cuda0003:2752869:2752869 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
cuda0003:2752520:2752520 [0] NCCL INFO comm 0x5599765a56d0 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
cuda0003:2752869:2752869 [1] NCCL INFO P2P Chunksize set to 524288
cuda0003:2752871:2752871 [3] NCCL INFO comm 0x5599765dd3b0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
cuda0003:2752520:2752520 [0] NCCL INFO Channel 00/02 : 0 1 2 3
cuda0003:2752870:2752870 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
cuda0003:2752870:2752870 [2] NCCL INFO P2P Chunksize set to 524288
cuda0003:2752520:2752520 [0] NCCL INFO Channel 01/02 : 0 1 2 3
cuda0003:2752871:2752871 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
cuda0003:2752520:2752520 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
cuda0003:2752871:2752871 [3] NCCL INFO P2P Chunksize set to 524288
cuda0003:2752520:2752520 [0] NCCL INFO P2P Chunksize set to 524288
cuda0003:2752869:2752869 [1] NCCL INFO Channel 00/0 : 1[3] -> 2[4] via P2P/IPC
cuda0003:2752520:2752520 [0] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/IPC
cuda0003:2752870:2752870 [2] NCCL INFO Channel 00/0 : 2[4] -> 3[5] via P2P/IPC
cuda0003:2752871:2752871 [3] NCCL INFO Channel 00/0 : 3[5] -> 0[2] via P2P/IPC
cuda0003:2752869:2752869 [1] NCCL INFO Channel 01/0 : 1[3] -> 2[4] via P2P/IPC
cuda0003:2752520:2752520 [0] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/IPC
cuda0003:2752870:2752870 [2] NCCL INFO Channel 01/0 : 2[4] -> 3[5] via P2P/IPC
cuda0003:2752871:2752871 [3] NCCL INFO Channel 01/0 : 3[5] -> 0[2] via P2P/IPC
cuda0003:2752869:2752869 [1] NCCL INFO Connected all rings
cuda0003:2752520:2752520 [0] NCCL INFO Connected all rings
cuda0003:2752870:2752870 [2] NCCL INFO Connected all rings
cuda0003:2752871:2752871 [3] NCCL INFO Connected all rings
cuda0003:2752871:2752871 [3] NCCL INFO Channel 00/0 : 3[5] -> 2[4] via P2P/IPC
cuda0003:2752869:2752869 [1] NCCL INFO Channel 00/0 : 1[3] -> 0[2] via P2P/IPC
cuda0003:2752871:2752871 [3] NCCL INFO Channel 01/0 : 3[5] -> 2[4] via P2P/IPC
cuda0003:2752869:2752869 [1] NCCL INFO Channel 01/0 : 1[3] -> 0[2] via P2P/IPC
cuda0003:2752870:2752870 [2] NCCL INFO Channel 00/0 : 2[4] -> 1[3] via P2P/IPC
cuda0003:2752870:2752870 [2] NCCL INFO Channel 01/0 : 2[4] -> 1[3] via P2P/IPC
cuda0003:2752520:2752520 [0] NCCL INFO Connected all trees
cuda0003:2752520:2752520 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2752520:2752520 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2752869:2752869 [1] NCCL INFO Connected all trees
cuda0003:2752869:2752869 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2752869:2752869 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2752871:2752871 [3] NCCL INFO Connected all trees
cuda0003:2752871:2752871 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2752870:2752870 [2] NCCL INFO Connected all trees
cuda0003:2752871:2752871 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2752870:2752870 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2752870:2752870 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2752870:2752870 [2] NCCL INFO comm 0x5599765d84d0 rank 2 nranks 4 cudaDev 2 nvmlDev 4 busId 81000 commId 0xd493450290593fa - Init COMPLETE
cuda0003:2752520:2752520 [0] NCCL INFO comm 0x5599765a56d0 rank 0 nranks 4 cudaDev 0 nvmlDev 2 busId 41000 commId 0xd493450290593fa - Init COMPLETE
cuda0003:2752871:2752871 [3] NCCL INFO comm 0x5599765dd3b0 rank 3 nranks 4 cudaDev 3 nvmlDev 5 busId a1000 commId 0xd493450290593fa - Init COMPLETE
cuda0003:2752869:2752869 [1] NCCL INFO comm 0x5599765a7070 rank 1 nranks 4 cudaDev 1 nvmlDev 3 busId 61000 commId 0xd493450290593fa - Init COMPLETE
DEBUG 10-01 13:21:54 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:04 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:14 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:24 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:34 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:44 client.py:164] Waiting for output from MQLLMEngine.
And running the test script (from the trouble shooting guide) generates the following output:
(vllm) user@cuda0003:/export/home/scratch/user/dev/huggingface$ CUDA_VISIBLE_DEVICES=2,3,4,5 NCCL_DEBUG=TRACE torchrun --nproc-per-node 4 test_vllm.py
W1001 15:56:02.450000 140287812335424 torch/distributed/run.py:779]
W1001 15:56:02.450000 140287812335424 torch/distributed/run.py:779] *****************************************
W1001 15:56:02.450000 140287812335424 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1001 15:56:02.450000 140287812335424 torch/distributed/run.py:779] *****************************************
[W1001 15:56:02.572902975 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:29500 (errno: 97 - Address family not supported by protocol).
[W1001 15:56:03.655999094 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W1001 15:56:03.664883302 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:29500 (errno: 97 - Address family not supported by protocol).
[W1001 15:56:03.699935605 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W1001 15:56:03.701742213 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:29500 (errno: 97 - Address family not supported by protocol).
cuda0003:2949601:2949601 [0] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2949601:2949601 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2949601:2949601 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
cuda0003:2949602:2949602 [1] NCCL INFO cudaDriverVersion 12020
cuda0003:2949603:2949603 [2] NCCL INFO cudaDriverVersion 12020
cuda0003:2949602:2949602 [1] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2949603:2949603 [2] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2949602:2949602 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2949603:2949603 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2949604:2949604 [3] NCCL INFO cudaDriverVersion 12020
cuda0003:2949604:2949604 [3] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2949604:2949604 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2949601:2949642 [0] NCCL INFO NET/IB : No device found.
cuda0003:2949601:2949642 [0] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2949601:2949642 [0] NCCL INFO Using non-device net plugin version 0
cuda0003:2949601:2949642 [0] NCCL INFO Using network Socket
cuda0003:2949602:2949643 [1] NCCL INFO NET/IB : No device found.
cuda0003:2949602:2949643 [1] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2949602:2949643 [1] NCCL INFO Using non-device net plugin version 0
cuda0003:2949602:2949643 [1] NCCL INFO Using network Socket
cuda0003:2949603:2949644 [2] NCCL INFO NET/IB : No device found.
cuda0003:2949603:2949644 [2] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2949603:2949644 [2] NCCL INFO Using non-device net plugin version 0
cuda0003:2949603:2949644 [2] NCCL INFO Using network Socket
cuda0003:2949604:2949645 [3] NCCL INFO NET/IB : No device found.
cuda0003:2949604:2949645 [3] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2949604:2949645 [3] NCCL INFO Using non-device net plugin version 0
cuda0003:2949604:2949645 [3] NCCL INFO Using network Socket
cuda0003:2949602:2949643 [1] NCCL INFO comm 0x5649b30cf960 rank 1 nranks 4 cudaDev 1 nvmlDev 3 busId 61000 commId 0x795520212f433144 - Init START
cuda0003:2949601:2949642 [0] NCCL INFO comm 0x55fb6f3ce680 rank 0 nranks 4 cudaDev 0 nvmlDev 2 busId 41000 commId 0x795520212f433144 - Init START
cuda0003:2949603:2949644 [2] NCCL INFO comm 0x55703115d1f0 rank 2 nranks 4 cudaDev 2 nvmlDev 4 busId 81000 commId 0x795520212f433144 - Init START
cuda0003:2949604:2949645 [3] NCCL INFO comm 0x55991793c100 rank 3 nranks 4 cudaDev 3 nvmlDev 5 busId a1000 commId 0x795520212f433144 - Init START
cuda0003:2949603:2949644 [2] NCCL INFO Setting affinity for GPU 4 to ffff0000,ffff0000
cuda0003:2949603:2949644 [2] NCCL INFO NVLS multicast support is not available on dev 2
cuda0003:2949604:2949645 [3] NCCL INFO Setting affinity for GPU 5 to ffff0000,ffff0000
cuda0003:2949604:2949645 [3] NCCL INFO NVLS multicast support is not available on dev 3
cuda0003:2949601:2949642 [0] NCCL INFO Setting affinity for GPU 2 to ffff,0000ffff
cuda0003:2949601:2949642 [0] NCCL INFO NVLS multicast support is not available on dev 0
cuda0003:2949602:2949643 [1] NCCL INFO Setting affinity for GPU 3 to ffff,0000ffff
cuda0003:2949602:2949643 [1] NCCL INFO NVLS multicast support is not available on dev 1
cuda0003:2949601:2949642 [0] NCCL INFO comm 0x55fb6f3ce680 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
cuda0003:2949601:2949642 [0] NCCL INFO Channel 00/02 : 0 1 2 3
cuda0003:2949601:2949642 [0] NCCL INFO Channel 01/02 : 0 1 2 3
cuda0003:2949604:2949645 [3] NCCL INFO comm 0x55991793c100 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
cuda0003:2949601:2949642 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
cuda0003:2949602:2949643 [1] NCCL INFO comm 0x5649b30cf960 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
cuda0003:2949603:2949644 [2] NCCL INFO comm 0x55703115d1f0 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
cuda0003:2949601:2949642 [0] NCCL INFO P2P Chunksize set to 524288
cuda0003:2949604:2949645 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
cuda0003:2949602:2949643 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
cuda0003:2949603:2949644 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
cuda0003:2949604:2949645 [3] NCCL INFO P2P Chunksize set to 524288
cuda0003:2949602:2949643 [1] NCCL INFO P2P Chunksize set to 524288
cuda0003:2949603:2949644 [2] NCCL INFO P2P Chunksize set to 524288
cuda0003:2949602:2949643 [1] NCCL INFO Channel 00/0 : 1[3] -> 2[4] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Channel 00/0 : 3[5] -> 0[2] via P2P/CUMEM
cuda0003:2949601:2949642 [0] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/CUMEM
cuda0003:2949602:2949643 [1] NCCL INFO Channel 01/0 : 1[3] -> 2[4] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Channel 01/0 : 3[5] -> 0[2] via P2P/CUMEM
cuda0003:2949601:2949642 [0] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/CUMEM
cuda0003:2949603:2949644 [2] NCCL INFO Channel 00/0 : 2[4] -> 3[5] via P2P/CUMEM
cuda0003:2949603:2949644 [2] NCCL INFO Channel 01/0 : 2[4] -> 3[5] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Connected all rings
cuda0003:2949604:2949645 [3] NCCL INFO Channel 00/0 : 3[5] -> 2[4] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Channel 01/0 : 3[5] -> 2[4] via P2P/CUMEM
cuda0003:2949603:2949644 [2] NCCL INFO Connected all rings
cuda0003:2949601:2949642 [0] NCCL INFO Connected all rings
cuda0003:2949603:2949644 [2] NCCL INFO Channel 00/0 : 2[4] -> 1[3] via P2P/CUMEM
cuda0003:2949603:2949644 [2] NCCL INFO Channel 01/0 : 2[4] -> 1[3] via P2P/CUMEM
cuda0003:2949602:2949643 [1] NCCL INFO Connected all rings
cuda0003:2949602:2949643 [1] NCCL INFO Channel 00/0 : 1[3] -> 0[2] via P2P/CUMEM
cuda0003:2949602:2949643 [1] NCCL INFO Channel 01/0 : 1[3] -> 0[2] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Connected all trees
cuda0003:2949604:2949645 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2949604:2949645 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2949601:2949642 [0] NCCL INFO Connected all trees
cuda0003:2949603:2949644 [2] NCCL INFO Connected all trees
cuda0003:2949601:2949642 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2949601:2949642 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2949603:2949644 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2949602:2949643 [1] NCCL INFO Connected all trees
cuda0003:2949603:2949644 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2949602:2949643 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2949602:2949643 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2949603:2949644 [2] NCCL INFO comm 0x55703115d1f0 rank 2 nranks 4 cudaDev 2 nvmlDev 4 busId 81000 commId 0x795520212f433144 - Init COMPLETE
cuda0003:2949602:2949643 [1] NCCL INFO comm 0x5649b30cf960 rank 1 nranks 4 cudaDev 1 nvmlDev 3 busId 61000 commId 0x795520212f433144 - Init COMPLETE
cuda0003:2949604:2949645 [3] NCCL INFO comm 0x55991793c100 rank 3 nranks 4 cudaDev 3 nvmlDev 5 busId a1000 commId 0x795520212f433144 - Init COMPLETE
cuda0003:2949601:2949642 [0] NCCL INFO comm 0x55fb6f3ce680 rank 0 nranks 4 cudaDev 0 nvmlDev 2 busId 41000 commId 0x795520212f433144 - Init COMPLETE
[rank2]:[E1001 16:06:04.537131842 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
[rank2]:[E1001 16:06:04.537356909 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E1001 16:06:04.539355394 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600012 milliseconds before timing out.
[rank0]:[E1001 16:06:04.539557761 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1001 16:06:04.547113185 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
[rank1]:[E1001 16:06:04.547325453 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1001 16:06:04.556502147 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600028 milliseconds before timing out.
[rank3]:[E1001 16:06:04.556720024 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]: Traceback (most recent call last):
[rank2]: File "/export/home/scratch/user/dev/huggingface/test_vllm.py", line 12, in <module>
[rank2]: assert value == world_size, f"Expected {world_size}, got {value}"
[rank2]: AssertionError: Expected 4, got 0.0
cuda0003:2949601:2949647 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[rank3]: Traceback (most recent call last):
[rank3]: File "/export/home/scratch/user/dev/huggingface/test_vllm.py", line 12, in <module>
[rank3]: assert value == world_size, f"Expected {world_size}, got {value}"
[rank3]: AssertionError: Expected 4, got 0.0
[rank1]: Traceback (most recent call last):
[rank1]: File "/export/home/scratch/user/dev/huggingface/test_vllm.py", line 12, in <module>
[rank1]: assert value == world_size, f"Expected {world_size}, got {value}"
[rank1]: AssertionError: Expected 4, got 0.0
[rank0]: Traceback (most recent call last):
[rank0]: File "/export/home/scratch/user/dev/huggingface/test_vllm.py", line 12, in <module>
[rank0]: assert value == world_size, f"Expected {world_size}, got {value}"
[rank0]: AssertionError: Expected 4, got 1.0
cuda0003:2949601:2949631 [0] NCCL INFO comm 0x55fb6f3ce680 rank 0 nranks 4 cudaDev 0 busId 41000 - Abort COMPLETE
[rank0]:[E1001 16:06:04.787342035 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E1001 16:06:04.787360155 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1001 16:06:04.787367634 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E1001 16:06:04.788294613 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600012 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6155dcff86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f61570cc8d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f61570d3313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f61570d56fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f61a487fdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f61a5be9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f61a59aa353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600012 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6155dcff86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f61570cc8d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f61570d3313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f61570d56fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f61a487fdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f61a5be9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f61a59aa353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6155dcff86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f6156d5ea84 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f61a487fdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f61a5be9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f61a59aa353 in /lib/x86_64-linux-gnu/libc.so.6)
cuda0003:2949603:2949650 [2] NCCL INFO [Service thread] Connection closed by localRank 2
cuda0003:2949603:2949625 [0] NCCL INFO comm 0x55703115d1f0 rank 2 nranks 4 cudaDev 2 busId 81000 - Abort COMPLETE
[rank2]:[E1001 16:06:04.857583158 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1001 16:06:04.857606318 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1001 16:06:04.857614068 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1001 16:06:04.858765584 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb77fbfcf86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb780ef98d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb780f00313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb780f026fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7fb7ce6acdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fb7cfa16609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fb7cf7d7353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb77fbfcf86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb780ef98d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb780f00313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb780f026fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7fb7ce6acdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fb7cfa16609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fb7cf7d7353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb77fbfcf86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7fb780b8ba84 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7fb7ce6acdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7fb7cfa16609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fb7cf7d7353 in /lib/x86_64-linux-gnu/libc.so.6)
cuda0003:2949604:2949648 [3] NCCL INFO [Service thread] Connection closed by localRank 3
cuda0003:2949602:2949649 [1] NCCL INFO [Service thread] Connection closed by localRank 1
cuda0003:2949604:2949623 [0] NCCL INFO comm 0x55991793c100 rank 3 nranks 4 cudaDev 3 busId a1000 - Abort COMPLETE
cuda0003:2949602:2949629 [0] NCCL INFO comm 0x5649b30cf960 rank 1 nranks 4 cudaDev 1 busId 61000 - Abort COMPLETE
[rank3]:[E1001 16:06:04.889919061 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1001 16:06:04.889949961 ProcessGroupNCCL.cpp:621] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1001 16:06:04.889963761 ProcessGroupNCCL.cpp:627] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1001 16:06:04.891761908 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600028 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f41ced77f86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f41d00748d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f41d007b313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f41d007d6fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f421d827df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f421eb91609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f421e952353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
[rank1]:[E1001 16:06:04.892371650 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1001 16:06:04.892395890 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1001 16:06:04.892420760 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
what(): [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600028 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f41ced77f86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f41d00748d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f41d007b313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f41d007d6fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f421d827df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f421eb91609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f421e952353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f41ced77f86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f41cfd06a84 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f421d827df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f421eb91609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f421e952353 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[E1001 16:06:04.893556365 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff248c6af86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ff249f678d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff249f6e313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff249f706fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7ff29771adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff298a88609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff298849353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff248c6af86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ff249f678d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff249f6e313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff249f706fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7ff29771adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff298a88609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff298849353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff248c6af86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7ff249bf9a84 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7ff29771adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7ff298a88609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7ff298849353 in /lib/x86_64-linux-gnu/libc.so.6)
W1001 16:06:04.960000 140287812335424 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2949602 closing signal SIGTERM
W1001 16:06:04.960000 140287812335424 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2949603 closing signal SIGTERM
W1001 16:06:04.960000 140287812335424 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2949604 closing signal SIGTERM
E1001 16:06:05.125000 140287812335424 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 2949601) of binary: /export/home/cuda00032/user/miniforge3/envs/vllm/bin/python3.11
Traceback (most recent call last):
File "/export/home/cuda00032/user/miniforge3/envs/vllm/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
test_vllm.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-01_16:06:04
host : Cuda0003.XX.XXX.COM
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 2949601)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2949601
========================================================
Can you show the full output of collect_env.py
? It should show the vLLM version and your GPU topology towards the end. It seems there is something wrong with the latter.
Your current environment
How would you like to use vllm
I am trying to serve
llama-3-2-11b-vision-instruct
with 2 out of 8 NVIDIA RTX A6000 GPUs by running:and it hangs without loading model weights, below is the terminal output. Also, I was able to serve a different model
minicpm-llama3-v-2-5
using the same environment without a problem.Before submitting a new issue...