Open rchen19 opened 2 months ago
Please go through the troubleshooting guide first and see if it can solve your problem.
I have produced some debugging logs, but I cannot figure out where it went wrong. And btw, I was able to serve a different model minicpm-llama3-v-2-5
using the same environment without a problem. Any help is appreciated. Thanks.
Adding debug flags and re-running the server command produced the following log
(vllm) user@cuda0003:/export/home/dev/huggingface$ CUDA_VISIBLE_DEVICES=2,3,4,5 VLLM_LOGGING_LEVEL=DEBUG CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=TRACE VLLM_TRACE_FUNCTION=1 vl
lm serve llama-3-2-11b-vision-instruct --host 0.0.0.0 --port 8006 --served-model-name llama-3-2-11b-vision-instruct --tensor-parallel-size 4 --enforce_eager
INFO 10-01 13:21:43 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-01 13:21:43 api_server.py:527] args: Namespace(model_tag='llama-3-2-11b-vision-instruct', config='', host='0.0.0.0', port=8006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='llama-3-2-11b-vision-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama-3-2-11b-vision-instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7f8ce606c5e0>)
INFO 10-01 13:21:43 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/2b6b90ba-21bd-48aa-857a-fe9e3d33af36 for IPC Path.
INFO 10-01 13:21:43 api_server.py:177] Started engine process with PID 2752520
INFO 10-01 13:21:44 config.py:899] Defaulting to use mp for distributed inference
WARNING 10-01 13:21:44 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 10-01 13:21:44 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-01 13:21:46 config.py:899] Defaulting to use mp for distributed inference
WARNING 10-01 13:21:46 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 10-01 13:21:46 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-01 13:21:46 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='llama-3-2-11b-vision-instruct', speculative_config=None, tokenizer='llama-3-2-11b-vision-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama-3-2-11b-vision-instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 10-01 13:21:46 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 10-01 13:21:47 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=2752869) WARNING 10-01 13:21:47 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:47 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b74bec7046e243f5a4c285d2c6cce1d9/VLLM_TRACE_FUNCTION_for_process_2752869_thread_139763790509888_at_2024-10-01_13:21:47.037250.log
(VllmWorkerProcess pid=2752870) WARNING 10-01 13:21:47 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:47 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b74bec7046e243f5a4c285d2c6cce1d9/VLLM_TRACE_FUNCTION_for_process_2752870_thread_139763790509888_at_2024-10-01_13:21:47.043849.log
WARNING 10-01 13:21:47 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 10-01 13:21:47 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b74bec7046e243f5a4c285d2c6cce1d9/VLLM_TRACE_FUNCTION_for_process_2752520_thread_139763790509888_at_2024-10-01_13:21:47.046676.log
(VllmWorkerProcess pid=2752871) WARNING 10-01 13:21:47 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:47 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b74bec7046e243f5a4c285d2c6cce1d9/VLLM_TRACE_FUNCTION_for_process_2752871_thread_139763790509888_at_2024-10-01_13:21:47.047104.log
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:47 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 10-01 13:21:47 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:47 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:47 selector.py:116] Using XFormers backend.
INFO 10-01 13:21:47 selector.py:116] Using XFormers backend.
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:47 selector.py:116] Using XFormers backend.
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:47 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:47 selector.py:116] Using XFormers backend.
/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2752870) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752870) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2752869) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752869) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2752871) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752871) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2752870) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752870) @torch.library.impl_abstract("xformers_flash::flash_bwd")
/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=2752869) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752869) @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=2752871) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752871) @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:49 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:49 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:49 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
DEBUG 10-01 13:21:50 parallel_state.py:937] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:34617 backend=nccl
[W1001 13:21:50.261356939 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:34617 (errno: 97 - Address family not supported by protocol).
(VllmWorkerProcess pid=2752869) DEBUG 10-01 13:21:50 parallel_state.py:937] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:34617 backend=nccl
[W1001 13:21:50.365314920 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:34617 (errno: 97 - Address family not supported by protocol).
(VllmWorkerProcess pid=2752871) DEBUG 10-01 13:21:50 parallel_state.py:937] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:34617 backend=nccl
(VllmWorkerProcess pid=2752870) DEBUG 10-01 13:21:50 parallel_state.py:937] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:34617 backend=nccl
[W1001 13:21:50.390135801 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:34617 (errno: 97 - Address family not supported by protocol).
[W1001 13:21:50.393598072 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:34617 (errno: 97 - Address family not supported by protocol).
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:50 utils.py:992] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:50 utils.py:992] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:50 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:50 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 10-01 13:21:50 utils.py:992] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:50 utils.py:992] Found nccl from library libnccl.so.2
INFO 10-01 13:21:50 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:50 pynccl.py:63] vLLM is using nccl==2.20.5
cuda0003:2752520:2752520 [0] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2752520:2752520 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2752520:2752520 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
cuda0003:2752871:2752871 [3] NCCL INFO cudaDriverVersion 12020
cuda0003:2752869:2752869 [1] NCCL INFO cudaDriverVersion 12020
cuda0003:2752870:2752870 [2] NCCL INFO cudaDriverVersion 12020
cuda0003:2752871:2752871 [3] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2752869:2752869 [1] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2752870:2752870 [2] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2752871:2752871 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2752869:2752869 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2752870:2752870 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2752869:2752869 [1] NCCL INFO NET/IB : No device found.
cuda0003:2752869:2752869 [1] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2752869:2752869 [1] NCCL INFO Using non-device net plugin version 0
cuda0003:2752869:2752869 [1] NCCL INFO Using network Socket
cuda0003:2752871:2752871 [3] NCCL INFO NET/IB : No device found.
cuda0003:2752871:2752871 [3] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2752871:2752871 [3] NCCL INFO Using non-device net plugin version 0
cuda0003:2752871:2752871 [3] NCCL INFO Using network Socket
cuda0003:2752520:2752520 [0] NCCL INFO NET/IB : No device found.
cuda0003:2752520:2752520 [0] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2752520:2752520 [0] NCCL INFO Using non-device net plugin version 0
cuda0003:2752520:2752520 [0] NCCL INFO Using network Socket
cuda0003:2752870:2752870 [2] NCCL INFO NET/IB : No device found.
cuda0003:2752870:2752870 [2] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2752870:2752870 [2] NCCL INFO Using non-device net plugin version 0
cuda0003:2752870:2752870 [2] NCCL INFO Using network Socket
cuda0003:2752870:2752870 [2] NCCL INFO comm 0x5599765d84d0 rank 2 nranks 4 cudaDev 2 nvmlDev 4 busId 81000 commId 0xd493450290593fa - Init START
cuda0003:2752869:2752869 [1] NCCL INFO comm 0x5599765a7070 rank 1 nranks 4 cudaDev 1 nvmlDev 3 busId 61000 commId 0xd493450290593fa - Init START
cuda0003:2752520:2752520 [0] NCCL INFO comm 0x5599765a56d0 rank 0 nranks 4 cudaDev 0 nvmlDev 2 busId 41000 commId 0xd493450290593fa - Init START
cuda0003:2752871:2752871 [3] NCCL INFO comm 0x5599765dd3b0 rank 3 nranks 4 cudaDev 3 nvmlDev 5 busId a1000 commId 0xd493450290593fa - Init START
cuda0003:2752870:2752870 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
cuda0003:2752869:2752869 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
cuda0003:2752871:2752871 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
cuda0003:2752520:2752520 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
cuda0003:2752870:2752870 [2] NCCL INFO Setting affinity for GPU 4 to ffff0000,ffff0000
cuda0003:2752869:2752869 [1] NCCL INFO Setting affinity for GPU 3 to ffff,0000ffff
cuda0003:2752869:2752869 [1] NCCL INFO NVLS multicast support is not available on dev 1
cuda0003:2752871:2752871 [3] NCCL INFO Setting affinity for GPU 5 to ffff0000,ffff0000
cuda0003:2752520:2752520 [0] NCCL INFO Setting affinity for GPU 2 to ffff,0000ffff
cuda0003:2752520:2752520 [0] NCCL INFO NVLS multicast support is not available on dev 0
cuda0003:2752871:2752871 [3] NCCL INFO NVLS multicast support is not available on dev 3
cuda0003:2752870:2752870 [2] NCCL INFO NVLS multicast support is not available on dev 2
cuda0003:2752869:2752869 [1] NCCL INFO comm 0x5599765a7070 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
cuda0003:2752870:2752870 [2] NCCL INFO comm 0x5599765d84d0 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
cuda0003:2752869:2752869 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
cuda0003:2752520:2752520 [0] NCCL INFO comm 0x5599765a56d0 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
cuda0003:2752869:2752869 [1] NCCL INFO P2P Chunksize set to 524288
cuda0003:2752871:2752871 [3] NCCL INFO comm 0x5599765dd3b0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
cuda0003:2752520:2752520 [0] NCCL INFO Channel 00/02 : 0 1 2 3
cuda0003:2752870:2752870 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
cuda0003:2752870:2752870 [2] NCCL INFO P2P Chunksize set to 524288
cuda0003:2752520:2752520 [0] NCCL INFO Channel 01/02 : 0 1 2 3
cuda0003:2752871:2752871 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
cuda0003:2752520:2752520 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
cuda0003:2752871:2752871 [3] NCCL INFO P2P Chunksize set to 524288
cuda0003:2752520:2752520 [0] NCCL INFO P2P Chunksize set to 524288
cuda0003:2752869:2752869 [1] NCCL INFO Channel 00/0 : 1[3] -> 2[4] via P2P/IPC
cuda0003:2752520:2752520 [0] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/IPC
cuda0003:2752870:2752870 [2] NCCL INFO Channel 00/0 : 2[4] -> 3[5] via P2P/IPC
cuda0003:2752871:2752871 [3] NCCL INFO Channel 00/0 : 3[5] -> 0[2] via P2P/IPC
cuda0003:2752869:2752869 [1] NCCL INFO Channel 01/0 : 1[3] -> 2[4] via P2P/IPC
cuda0003:2752520:2752520 [0] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/IPC
cuda0003:2752870:2752870 [2] NCCL INFO Channel 01/0 : 2[4] -> 3[5] via P2P/IPC
cuda0003:2752871:2752871 [3] NCCL INFO Channel 01/0 : 3[5] -> 0[2] via P2P/IPC
cuda0003:2752869:2752869 [1] NCCL INFO Connected all rings
cuda0003:2752520:2752520 [0] NCCL INFO Connected all rings
cuda0003:2752870:2752870 [2] NCCL INFO Connected all rings
cuda0003:2752871:2752871 [3] NCCL INFO Connected all rings
cuda0003:2752871:2752871 [3] NCCL INFO Channel 00/0 : 3[5] -> 2[4] via P2P/IPC
cuda0003:2752869:2752869 [1] NCCL INFO Channel 00/0 : 1[3] -> 0[2] via P2P/IPC
cuda0003:2752871:2752871 [3] NCCL INFO Channel 01/0 : 3[5] -> 2[4] via P2P/IPC
cuda0003:2752869:2752869 [1] NCCL INFO Channel 01/0 : 1[3] -> 0[2] via P2P/IPC
cuda0003:2752870:2752870 [2] NCCL INFO Channel 00/0 : 2[4] -> 1[3] via P2P/IPC
cuda0003:2752870:2752870 [2] NCCL INFO Channel 01/0 : 2[4] -> 1[3] via P2P/IPC
cuda0003:2752520:2752520 [0] NCCL INFO Connected all trees
cuda0003:2752520:2752520 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2752520:2752520 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2752869:2752869 [1] NCCL INFO Connected all trees
cuda0003:2752869:2752869 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2752869:2752869 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2752871:2752871 [3] NCCL INFO Connected all trees
cuda0003:2752871:2752871 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2752870:2752870 [2] NCCL INFO Connected all trees
cuda0003:2752871:2752871 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2752870:2752870 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2752870:2752870 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2752870:2752870 [2] NCCL INFO comm 0x5599765d84d0 rank 2 nranks 4 cudaDev 2 nvmlDev 4 busId 81000 commId 0xd493450290593fa - Init COMPLETE
cuda0003:2752520:2752520 [0] NCCL INFO comm 0x5599765a56d0 rank 0 nranks 4 cudaDev 0 nvmlDev 2 busId 41000 commId 0xd493450290593fa - Init COMPLETE
cuda0003:2752871:2752871 [3] NCCL INFO comm 0x5599765dd3b0 rank 3 nranks 4 cudaDev 3 nvmlDev 5 busId a1000 commId 0xd493450290593fa - Init COMPLETE
cuda0003:2752869:2752869 [1] NCCL INFO comm 0x5599765a7070 rank 1 nranks 4 cudaDev 1 nvmlDev 3 busId 61000 commId 0xd493450290593fa - Init COMPLETE
DEBUG 10-01 13:21:54 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:04 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:14 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:24 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:34 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:44 client.py:164] Waiting for output from MQLLMEngine.
And running the test script (from the trouble shooting guide) generates the following output:
(vllm) user@cuda0003:/export/home/scratch/user/dev/huggingface$ CUDA_VISIBLE_DEVICES=2,3,4,5 NCCL_DEBUG=TRACE torchrun --nproc-per-node 4 test_vllm.py
W1001 15:56:02.450000 140287812335424 torch/distributed/run.py:779]
W1001 15:56:02.450000 140287812335424 torch/distributed/run.py:779] *****************************************
W1001 15:56:02.450000 140287812335424 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1001 15:56:02.450000 140287812335424 torch/distributed/run.py:779] *****************************************
[W1001 15:56:02.572902975 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:29500 (errno: 97 - Address family not supported by protocol).
[W1001 15:56:03.655999094 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W1001 15:56:03.664883302 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:29500 (errno: 97 - Address family not supported by protocol).
[W1001 15:56:03.699935605 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W1001 15:56:03.701742213 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:29500 (errno: 97 - Address family not supported by protocol).
cuda0003:2949601:2949601 [0] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2949601:2949601 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2949601:2949601 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
cuda0003:2949602:2949602 [1] NCCL INFO cudaDriverVersion 12020
cuda0003:2949603:2949603 [2] NCCL INFO cudaDriverVersion 12020
cuda0003:2949602:2949602 [1] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2949603:2949603 [2] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2949602:2949602 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2949603:2949603 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2949604:2949604 [3] NCCL INFO cudaDriverVersion 12020
cuda0003:2949604:2949604 [3] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2949604:2949604 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2949601:2949642 [0] NCCL INFO NET/IB : No device found.
cuda0003:2949601:2949642 [0] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2949601:2949642 [0] NCCL INFO Using non-device net plugin version 0
cuda0003:2949601:2949642 [0] NCCL INFO Using network Socket
cuda0003:2949602:2949643 [1] NCCL INFO NET/IB : No device found.
cuda0003:2949602:2949643 [1] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2949602:2949643 [1] NCCL INFO Using non-device net plugin version 0
cuda0003:2949602:2949643 [1] NCCL INFO Using network Socket
cuda0003:2949603:2949644 [2] NCCL INFO NET/IB : No device found.
cuda0003:2949603:2949644 [2] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2949603:2949644 [2] NCCL INFO Using non-device net plugin version 0
cuda0003:2949603:2949644 [2] NCCL INFO Using network Socket
cuda0003:2949604:2949645 [3] NCCL INFO NET/IB : No device found.
cuda0003:2949604:2949645 [3] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2949604:2949645 [3] NCCL INFO Using non-device net plugin version 0
cuda0003:2949604:2949645 [3] NCCL INFO Using network Socket
cuda0003:2949602:2949643 [1] NCCL INFO comm 0x5649b30cf960 rank 1 nranks 4 cudaDev 1 nvmlDev 3 busId 61000 commId 0x795520212f433144 - Init START
cuda0003:2949601:2949642 [0] NCCL INFO comm 0x55fb6f3ce680 rank 0 nranks 4 cudaDev 0 nvmlDev 2 busId 41000 commId 0x795520212f433144 - Init START
cuda0003:2949603:2949644 [2] NCCL INFO comm 0x55703115d1f0 rank 2 nranks 4 cudaDev 2 nvmlDev 4 busId 81000 commId 0x795520212f433144 - Init START
cuda0003:2949604:2949645 [3] NCCL INFO comm 0x55991793c100 rank 3 nranks 4 cudaDev 3 nvmlDev 5 busId a1000 commId 0x795520212f433144 - Init START
cuda0003:2949603:2949644 [2] NCCL INFO Setting affinity for GPU 4 to ffff0000,ffff0000
cuda0003:2949603:2949644 [2] NCCL INFO NVLS multicast support is not available on dev 2
cuda0003:2949604:2949645 [3] NCCL INFO Setting affinity for GPU 5 to ffff0000,ffff0000
cuda0003:2949604:2949645 [3] NCCL INFO NVLS multicast support is not available on dev 3
cuda0003:2949601:2949642 [0] NCCL INFO Setting affinity for GPU 2 to ffff,0000ffff
cuda0003:2949601:2949642 [0] NCCL INFO NVLS multicast support is not available on dev 0
cuda0003:2949602:2949643 [1] NCCL INFO Setting affinity for GPU 3 to ffff,0000ffff
cuda0003:2949602:2949643 [1] NCCL INFO NVLS multicast support is not available on dev 1
cuda0003:2949601:2949642 [0] NCCL INFO comm 0x55fb6f3ce680 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
cuda0003:2949601:2949642 [0] NCCL INFO Channel 00/02 : 0 1 2 3
cuda0003:2949601:2949642 [0] NCCL INFO Channel 01/02 : 0 1 2 3
cuda0003:2949604:2949645 [3] NCCL INFO comm 0x55991793c100 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
cuda0003:2949601:2949642 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
cuda0003:2949602:2949643 [1] NCCL INFO comm 0x5649b30cf960 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
cuda0003:2949603:2949644 [2] NCCL INFO comm 0x55703115d1f0 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
cuda0003:2949601:2949642 [0] NCCL INFO P2P Chunksize set to 524288
cuda0003:2949604:2949645 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
cuda0003:2949602:2949643 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
cuda0003:2949603:2949644 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
cuda0003:2949604:2949645 [3] NCCL INFO P2P Chunksize set to 524288
cuda0003:2949602:2949643 [1] NCCL INFO P2P Chunksize set to 524288
cuda0003:2949603:2949644 [2] NCCL INFO P2P Chunksize set to 524288
cuda0003:2949602:2949643 [1] NCCL INFO Channel 00/0 : 1[3] -> 2[4] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Channel 00/0 : 3[5] -> 0[2] via P2P/CUMEM
cuda0003:2949601:2949642 [0] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/CUMEM
cuda0003:2949602:2949643 [1] NCCL INFO Channel 01/0 : 1[3] -> 2[4] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Channel 01/0 : 3[5] -> 0[2] via P2P/CUMEM
cuda0003:2949601:2949642 [0] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/CUMEM
cuda0003:2949603:2949644 [2] NCCL INFO Channel 00/0 : 2[4] -> 3[5] via P2P/CUMEM
cuda0003:2949603:2949644 [2] NCCL INFO Channel 01/0 : 2[4] -> 3[5] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Connected all rings
cuda0003:2949604:2949645 [3] NCCL INFO Channel 00/0 : 3[5] -> 2[4] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Channel 01/0 : 3[5] -> 2[4] via P2P/CUMEM
cuda0003:2949603:2949644 [2] NCCL INFO Connected all rings
cuda0003:2949601:2949642 [0] NCCL INFO Connected all rings
cuda0003:2949603:2949644 [2] NCCL INFO Channel 00/0 : 2[4] -> 1[3] via P2P/CUMEM
cuda0003:2949603:2949644 [2] NCCL INFO Channel 01/0 : 2[4] -> 1[3] via P2P/CUMEM
cuda0003:2949602:2949643 [1] NCCL INFO Connected all rings
cuda0003:2949602:2949643 [1] NCCL INFO Channel 00/0 : 1[3] -> 0[2] via P2P/CUMEM
cuda0003:2949602:2949643 [1] NCCL INFO Channel 01/0 : 1[3] -> 0[2] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Connected all trees
cuda0003:2949604:2949645 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2949604:2949645 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2949601:2949642 [0] NCCL INFO Connected all trees
cuda0003:2949603:2949644 [2] NCCL INFO Connected all trees
cuda0003:2949601:2949642 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2949601:2949642 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2949603:2949644 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2949602:2949643 [1] NCCL INFO Connected all trees
cuda0003:2949603:2949644 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2949602:2949643 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2949602:2949643 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2949603:2949644 [2] NCCL INFO comm 0x55703115d1f0 rank 2 nranks 4 cudaDev 2 nvmlDev 4 busId 81000 commId 0x795520212f433144 - Init COMPLETE
cuda0003:2949602:2949643 [1] NCCL INFO comm 0x5649b30cf960 rank 1 nranks 4 cudaDev 1 nvmlDev 3 busId 61000 commId 0x795520212f433144 - Init COMPLETE
cuda0003:2949604:2949645 [3] NCCL INFO comm 0x55991793c100 rank 3 nranks 4 cudaDev 3 nvmlDev 5 busId a1000 commId 0x795520212f433144 - Init COMPLETE
cuda0003:2949601:2949642 [0] NCCL INFO comm 0x55fb6f3ce680 rank 0 nranks 4 cudaDev 0 nvmlDev 2 busId 41000 commId 0x795520212f433144 - Init COMPLETE
[rank2]:[E1001 16:06:04.537131842 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
[rank2]:[E1001 16:06:04.537356909 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E1001 16:06:04.539355394 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600012 milliseconds before timing out.
[rank0]:[E1001 16:06:04.539557761 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1001 16:06:04.547113185 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
[rank1]:[E1001 16:06:04.547325453 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1001 16:06:04.556502147 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600028 milliseconds before timing out.
[rank3]:[E1001 16:06:04.556720024 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]: Traceback (most recent call last):
[rank2]: File "/export/home/scratch/user/dev/huggingface/test_vllm.py", line 12, in <module>
[rank2]: assert value == world_size, f"Expected {world_size}, got {value}"
[rank2]: AssertionError: Expected 4, got 0.0
cuda0003:2949601:2949647 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[rank3]: Traceback (most recent call last):
[rank3]: File "/export/home/scratch/user/dev/huggingface/test_vllm.py", line 12, in <module>
[rank3]: assert value == world_size, f"Expected {world_size}, got {value}"
[rank3]: AssertionError: Expected 4, got 0.0
[rank1]: Traceback (most recent call last):
[rank1]: File "/export/home/scratch/user/dev/huggingface/test_vllm.py", line 12, in <module>
[rank1]: assert value == world_size, f"Expected {world_size}, got {value}"
[rank1]: AssertionError: Expected 4, got 0.0
[rank0]: Traceback (most recent call last):
[rank0]: File "/export/home/scratch/user/dev/huggingface/test_vllm.py", line 12, in <module>
[rank0]: assert value == world_size, f"Expected {world_size}, got {value}"
[rank0]: AssertionError: Expected 4, got 1.0
cuda0003:2949601:2949631 [0] NCCL INFO comm 0x55fb6f3ce680 rank 0 nranks 4 cudaDev 0 busId 41000 - Abort COMPLETE
[rank0]:[E1001 16:06:04.787342035 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E1001 16:06:04.787360155 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1001 16:06:04.787367634 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E1001 16:06:04.788294613 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600012 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6155dcff86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f61570cc8d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f61570d3313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f61570d56fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f61a487fdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f61a5be9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f61a59aa353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600012 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6155dcff86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f61570cc8d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f61570d3313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f61570d56fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f61a487fdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f61a5be9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f61a59aa353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6155dcff86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f6156d5ea84 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f61a487fdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f61a5be9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f61a59aa353 in /lib/x86_64-linux-gnu/libc.so.6)
cuda0003:2949603:2949650 [2] NCCL INFO [Service thread] Connection closed by localRank 2
cuda0003:2949603:2949625 [0] NCCL INFO comm 0x55703115d1f0 rank 2 nranks 4 cudaDev 2 busId 81000 - Abort COMPLETE
[rank2]:[E1001 16:06:04.857583158 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1001 16:06:04.857606318 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1001 16:06:04.857614068 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1001 16:06:04.858765584 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb77fbfcf86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb780ef98d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb780f00313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb780f026fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7fb7ce6acdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fb7cfa16609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fb7cf7d7353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb77fbfcf86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb780ef98d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb780f00313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb780f026fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7fb7ce6acdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fb7cfa16609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fb7cf7d7353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb77fbfcf86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7fb780b8ba84 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7fb7ce6acdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7fb7cfa16609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fb7cf7d7353 in /lib/x86_64-linux-gnu/libc.so.6)
cuda0003:2949604:2949648 [3] NCCL INFO [Service thread] Connection closed by localRank 3
cuda0003:2949602:2949649 [1] NCCL INFO [Service thread] Connection closed by localRank 1
cuda0003:2949604:2949623 [0] NCCL INFO comm 0x55991793c100 rank 3 nranks 4 cudaDev 3 busId a1000 - Abort COMPLETE
cuda0003:2949602:2949629 [0] NCCL INFO comm 0x5649b30cf960 rank 1 nranks 4 cudaDev 1 busId 61000 - Abort COMPLETE
[rank3]:[E1001 16:06:04.889919061 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1001 16:06:04.889949961 ProcessGroupNCCL.cpp:621] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1001 16:06:04.889963761 ProcessGroupNCCL.cpp:627] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1001 16:06:04.891761908 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600028 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f41ced77f86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f41d00748d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f41d007b313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f41d007d6fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f421d827df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f421eb91609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f421e952353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
[rank1]:[E1001 16:06:04.892371650 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1001 16:06:04.892395890 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1001 16:06:04.892420760 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
what(): [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600028 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f41ced77f86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f41d00748d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f41d007b313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f41d007d6fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f421d827df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f421eb91609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f421e952353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f41ced77f86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f41cfd06a84 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f421d827df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f421eb91609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f421e952353 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[E1001 16:06:04.893556365 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff248c6af86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ff249f678d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff249f6e313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff249f706fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7ff29771adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff298a88609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff298849353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff248c6af86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ff249f678d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff249f6e313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff249f706fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7ff29771adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff298a88609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff298849353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff248c6af86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7ff249bf9a84 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7ff29771adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7ff298a88609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7ff298849353 in /lib/x86_64-linux-gnu/libc.so.6)
W1001 16:06:04.960000 140287812335424 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2949602 closing signal SIGTERM
W1001 16:06:04.960000 140287812335424 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2949603 closing signal SIGTERM
W1001 16:06:04.960000 140287812335424 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2949604 closing signal SIGTERM
E1001 16:06:05.125000 140287812335424 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 2949601) of binary: /export/home/cuda00032/user/miniforge3/envs/vllm/bin/python3.11
Traceback (most recent call last):
File "/export/home/cuda00032/user/miniforge3/envs/vllm/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
test_vllm.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-01_16:06:04
host : Cuda0003.XX.XXX.COM
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 2949601)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2949601
========================================================
Can you show the full output of collect_env.py
? It should show the vLLM version and your GPU topology towards the end. It seems there is something wrong with the latter.
Can you show the full output of
collect_env.py
? It should show the vLLM version and your GPU topology towards the end. It seems there is something wrong with the latter.
You are right, I didn't realize the output was truncated, below is the full output. It says "vLLM Version: 0.6.1.dev238+ge2c6e0a82", but pip list | grep vllm
gives vllm 0.6.2
.
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31
Python version: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000
GPU 4: NVIDIA RTX A6000
GPU 5: NVIDIA RTX A6000
GPU 6: NVIDIA RTX A6000
GPU 7: NVIDIA RTX A6000
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7313 16-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 3355.300
CPU max MHz: 3000.0000
CPU min MHz: 1500.0000
BogoMIPS: 6000.12
Virtualization: AMD-V
L1d cache: 1 MiB
L1i cache: 1 MiB
L2 cache: 16 MiB
L3 cache: 256 MiB
NUMA node0 CPU(s): 0-15,32-47
NUMA node1 CPU(s): 16-31,48-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.68
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.1
[pip3] triton==3.0.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi
[conda] nvidia-ml-py 12.560.30 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.6.68 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi
[conda] pyzmq 26.2.0 pypi_0 pypi
[conda] torch 2.4.0 pypi_0 pypi
[conda] torchvision 0.19.0 pypi_0 pypi
[conda] transformers 4.45.1 pypi_0 pypi
[conda] triton 3.0.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.1.dev238+ge2c6e0a82
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 SYS SYS SYS SYS SYS SYS 0-15,32-47 0 N/A
GPU1 NV4 X SYS SYS SYS SYS SYS SYS 0-15,32-47 0 N/A
GPU2 SYS SYS X SYS SYS SYS SYS SYS 0-15,32-47 0 N/A
GPU3 SYS SYS SYS X SYS SYS SYS SYS 0-15,32-47 0 N/A
GPU4 SYS SYS SYS SYS X NV4 SYS SYS 16-31,48-63 N/A N/A
GPU5 SYS SYS SYS SYS NV4 X SYS SYS 16-31,48-63 N/A N/A
GPU6 SYS SYS SYS SYS SYS SYS X NV4 16-31,48-63 N/A N/A
GPU7 SYS SYS SYS SYS SYS SYS NV4 X 16-31,48-63 N/A N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
@youkaichao can you take a look at this?
AssertionError: Expected 4, got 1.0
it looks like some gpus are broken. allreduce correctness tests fail. can you try the rest gpus, or try some other machines?
AssertionError: Expected 4, got 1.0
I am fairly certain there are no hardware issues with these GPUs, we have been running jobs on them constantly. We had no problem serving minicpm-llama3-v-2-5
using vllm on one of the GPUs.
I just tried running the test script on single GPU individually by: OK this is because when using a single GPU, the init method of CUDA_VISIBLE_DEVICES=4 NCCL_DEBUG=TRACE torchrun --nproc-per-node=1 test_vllm.py
, and for all the GPUs separately, I am getting below output. Looks like a AttributeError: 'PyNcclCommunicator' object has no attribute 'device'
caused the subprocess to fail.PyNcclCommunicator
will return before device
attribute is set, so this is expected behavior.
It is still unclear to me if there is something wrong with communication between the GPUs, since serving a model on a single GPU using vllm does not seem to have an issue. And I have tried serving a llava model (llava-onevision-qwen2-7b-si
) with SGLang on multiple GPUs without a problem.
CUDA_VISIBLE_DEVICES=4 NCCL_DEBUG=TRACE torchrun --nproc-per-node=1 test_vllm.py
the script needs to be run with multiple gpus.
Your current environment
How would you like to use vllm
I am trying to serve
llama-3-2-11b-vision-instruct
with 2 out of 8 NVIDIA RTX A6000 GPUs by running:and it hangs without loading model weights, below is the terminal output. Also, I was able to serve a different model
minicpm-llama3-v-2-5
using the same environment without a problem.Before submitting a new issue...