vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.05k stars 4.72k forks source link

[Usage]: Serving Llama 3.2 `llama-3-2-11b-vision-instruct` hangs #8978

Open rchen19 opened 2 months ago

rchen19 commented 2 months ago

Your current environment

The output of `python collect_env.py`
<frozen runpy>:128: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000
GPU 4: NVIDIA RTX A6000
GPU 5: NVIDIA RTX A6000
GPU 6: NVIDIA RTX A6000
GPU 7: NVIDIA RTX A6000

Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             64
On-line CPU(s) list:                0-63
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              1
Model name:                         AMD EPYC 7313 16-Core Processor
Stepping:                           1
Frequency boost:                    enabled
CPU MHz:                            2974.488
CPU max MHz:                        3000.0000
CPU min MHz:                        1500.0000
BogoMIPS:                           6000.12
Virtualization:                     AMD-V
L1d cache:                          1 MiB
L1i cache:                          1 MiB
L2 cache:                           16 MiB
L3 cache:                           256 MiB
NUMA node0 CPU(s):                  0-15,32-47
NUMA node1 CPU(s):                  16-31,48-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.4.0                    pypi_0    pypi
[conda] torchvision               0.19.0                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi

How would you like to use vllm

I am trying to serve llama-3-2-11b-vision-instruct with 2 out of 8 NVIDIA RTX A6000 GPUs by running:

CUDA_VISIBLE_DEVICES=4,6 vllm /huggingface/llama-3-2-11b-vision-instruct --port 8005 --served-model-name llama-3-2-11b-vision-instruct --tensor-parallel-size 2 --enforce_eager

and it hangs without loading model weights, below is the terminal output. Also, I was able to serve a different model minicpm-llama3-v-2-5 using the same environment without a problem.

INFO 09-30 13:45:53 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 09-30 13:45:53 api_server.py:527] args: Namespace(model_tag='/huggingface/llama-3-2-11b-vision-instruct', config='', host='0.0.0.0', port=8005, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/huggingface/llama-3-2-11b-vision-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=16, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama-3-2-11b-vision-instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7f176bf20540>)
INFO 09-30 13:45:53 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/f080339d-eab2-428e-8316-2921d359f70c for IPC Path.
INFO 09-30 13:45:53 api_server.py:177] Started engine process with PID 1140470
INFO 09-30 13:45:53 config.py:899] Defaulting to use mp for distributed inference
WARNING 09-30 13:45:53 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 09-30 13:45:57 config.py:899] Defaulting to use mp for distributed inference
WARNING 09-30 13:45:57 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 09-30 13:45:57 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/huggingface/llama-3-2-11b-vision-instruct', speculative_config=None, tokenizer='/huggingface/llama-3-2-11b-vision-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama-3-2-11b-vision-instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 09-30 13:45:57 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-30 13:45:57 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 09-30 13:45:57 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 09-30 13:45:57 selector.py:116] Using XFormers backend.
(VllmWorkerProcess pid=1140725) INFO 09-30 13:45:57 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=1140725) INFO 09-30 13:45:57 selector.py:116] Using XFormers backend.
/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=1140725) /miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=1140725)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=1140725) /miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=1140725)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=1140725) INFO 09-30 13:45:57 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
[W930 13:45:58.569779258 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:40013 (errno: 97 - Address family not supported by protocol).
[W930 13:45:58.804302588 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:40013 (errno: 97 - Address family not supported by protocol).
INFO 09-30 13:45:58 utils.py:992] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=1140725) INFO 09-30 13:45:58 utils.py:992] Found nccl from library libnccl.so.2
INFO 09-30 13:45:58 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=1140725) INFO 09-30 13:45:58 pynccl.py:63] vLLM is using nccl==2.20.5

Before submitting a new issue...

DarkLight1337 commented 2 months ago

Please go through the troubleshooting guide first and see if it can solve your problem.

rchen19 commented 1 month ago

I have produced some debugging logs, but I cannot figure out where it went wrong. And btw, I was able to serve a different model minicpm-llama3-v-2-5 using the same environment without a problem. Any help is appreciated. Thanks.

Adding debug flags and re-running the server command produced the following log

(vllm) user@cuda0003:/export/home/dev/huggingface$ CUDA_VISIBLE_DEVICES=2,3,4,5 VLLM_LOGGING_LEVEL=DEBUG CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=TRACE VLLM_TRACE_FUNCTION=1 vl
lm serve llama-3-2-11b-vision-instruct --host 0.0.0.0 --port 8006 --served-model-name llama-3-2-11b-vision-instruct --tensor-parallel-size 4 --enforce_eager
INFO 10-01 13:21:43 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-01 13:21:43 api_server.py:527] args: Namespace(model_tag='llama-3-2-11b-vision-instruct', config='', host='0.0.0.0', port=8006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='llama-3-2-11b-vision-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama-3-2-11b-vision-instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7f8ce606c5e0>)
INFO 10-01 13:21:43 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/2b6b90ba-21bd-48aa-857a-fe9e3d33af36 for IPC Path.
INFO 10-01 13:21:43 api_server.py:177] Started engine process with PID 2752520
INFO 10-01 13:21:44 config.py:899] Defaulting to use mp for distributed inference
WARNING 10-01 13:21:44 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 10-01 13:21:44 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-01 13:21:46 config.py:899] Defaulting to use mp for distributed inference
WARNING 10-01 13:21:46 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 10-01 13:21:46 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-01 13:21:46 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='llama-3-2-11b-vision-instruct', speculative_config=None, tokenizer='llama-3-2-11b-vision-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama-3-2-11b-vision-instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 10-01 13:21:46 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 10-01 13:21:47 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=2752869) WARNING 10-01 13:21:47 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:47 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b74bec7046e243f5a4c285d2c6cce1d9/VLLM_TRACE_FUNCTION_for_process_2752869_thread_139763790509888_at_2024-10-01_13:21:47.037250.log
(VllmWorkerProcess pid=2752870) WARNING 10-01 13:21:47 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:47 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b74bec7046e243f5a4c285d2c6cce1d9/VLLM_TRACE_FUNCTION_for_process_2752870_thread_139763790509888_at_2024-10-01_13:21:47.043849.log
WARNING 10-01 13:21:47 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 10-01 13:21:47 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b74bec7046e243f5a4c285d2c6cce1d9/VLLM_TRACE_FUNCTION_for_process_2752520_thread_139763790509888_at_2024-10-01_13:21:47.046676.log
(VllmWorkerProcess pid=2752871) WARNING 10-01 13:21:47 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:47 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b74bec7046e243f5a4c285d2c6cce1d9/VLLM_TRACE_FUNCTION_for_process_2752871_thread_139763790509888_at_2024-10-01_13:21:47.047104.log
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:47 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 10-01 13:21:47 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:47 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:47 selector.py:116] Using XFormers backend.
INFO 10-01 13:21:47 selector.py:116] Using XFormers backend.
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:47 selector.py:116] Using XFormers backend.
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:47 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:47 selector.py:116] Using XFormers backend.
/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2752870) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752870)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2752869) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752869)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2752871) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752871)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2752870) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752870)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=2752869) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752869)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=2752871) /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2752871)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:49 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:49 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:49 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
DEBUG 10-01 13:21:50 parallel_state.py:937] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:34617 backend=nccl
[W1001 13:21:50.261356939 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:34617 (errno: 97 - Address family not supported by protocol).
(VllmWorkerProcess pid=2752869) DEBUG 10-01 13:21:50 parallel_state.py:937] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:34617 backend=nccl
[W1001 13:21:50.365314920 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:34617 (errno: 97 - Address family not supported by protocol).
(VllmWorkerProcess pid=2752871) DEBUG 10-01 13:21:50 parallel_state.py:937] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:34617 backend=nccl
(VllmWorkerProcess pid=2752870) DEBUG 10-01 13:21:50 parallel_state.py:937] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:34617 backend=nccl
[W1001 13:21:50.390135801 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:34617 (errno: 97 - Address family not supported by protocol).
[W1001 13:21:50.393598072 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:34617 (errno: 97 - Address family not supported by protocol).
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:50 utils.py:992] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:50 utils.py:992] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2752871) INFO 10-01 13:21:50 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2752869) INFO 10-01 13:21:50 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 10-01 13:21:50 utils.py:992] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:50 utils.py:992] Found nccl from library libnccl.so.2
INFO 10-01 13:21:50 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2752870) INFO 10-01 13:21:50 pynccl.py:63] vLLM is using nccl==2.20.5
cuda0003:2752520:2752520 [0] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2752520:2752520 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2752520:2752520 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
cuda0003:2752871:2752871 [3] NCCL INFO cudaDriverVersion 12020
cuda0003:2752869:2752869 [1] NCCL INFO cudaDriverVersion 12020
cuda0003:2752870:2752870 [2] NCCL INFO cudaDriverVersion 12020
cuda0003:2752871:2752871 [3] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2752869:2752869 [1] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2752870:2752870 [2] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2752871:2752871 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2752869:2752869 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2752870:2752870 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2752869:2752869 [1] NCCL INFO NET/IB : No device found.
cuda0003:2752869:2752869 [1] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2752869:2752869 [1] NCCL INFO Using non-device net plugin version 0
cuda0003:2752869:2752869 [1] NCCL INFO Using network Socket
cuda0003:2752871:2752871 [3] NCCL INFO NET/IB : No device found.
cuda0003:2752871:2752871 [3] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2752871:2752871 [3] NCCL INFO Using non-device net plugin version 0
cuda0003:2752871:2752871 [3] NCCL INFO Using network Socket
cuda0003:2752520:2752520 [0] NCCL INFO NET/IB : No device found.
cuda0003:2752520:2752520 [0] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2752520:2752520 [0] NCCL INFO Using non-device net plugin version 0
cuda0003:2752520:2752520 [0] NCCL INFO Using network Socket
cuda0003:2752870:2752870 [2] NCCL INFO NET/IB : No device found.
cuda0003:2752870:2752870 [2] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2752870:2752870 [2] NCCL INFO Using non-device net plugin version 0
cuda0003:2752870:2752870 [2] NCCL INFO Using network Socket
cuda0003:2752870:2752870 [2] NCCL INFO comm 0x5599765d84d0 rank 2 nranks 4 cudaDev 2 nvmlDev 4 busId 81000 commId 0xd493450290593fa - Init START
cuda0003:2752869:2752869 [1] NCCL INFO comm 0x5599765a7070 rank 1 nranks 4 cudaDev 1 nvmlDev 3 busId 61000 commId 0xd493450290593fa - Init START
cuda0003:2752520:2752520 [0] NCCL INFO comm 0x5599765a56d0 rank 0 nranks 4 cudaDev 0 nvmlDev 2 busId 41000 commId 0xd493450290593fa - Init START
cuda0003:2752871:2752871 [3] NCCL INFO comm 0x5599765dd3b0 rank 3 nranks 4 cudaDev 3 nvmlDev 5 busId a1000 commId 0xd493450290593fa - Init START
cuda0003:2752870:2752870 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
cuda0003:2752869:2752869 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
cuda0003:2752871:2752871 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
cuda0003:2752520:2752520 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
cuda0003:2752870:2752870 [2] NCCL INFO Setting affinity for GPU 4 to ffff0000,ffff0000
cuda0003:2752869:2752869 [1] NCCL INFO Setting affinity for GPU 3 to ffff,0000ffff
cuda0003:2752869:2752869 [1] NCCL INFO NVLS multicast support is not available on dev 1
cuda0003:2752871:2752871 [3] NCCL INFO Setting affinity for GPU 5 to ffff0000,ffff0000
cuda0003:2752520:2752520 [0] NCCL INFO Setting affinity for GPU 2 to ffff,0000ffff
cuda0003:2752520:2752520 [0] NCCL INFO NVLS multicast support is not available on dev 0
cuda0003:2752871:2752871 [3] NCCL INFO NVLS multicast support is not available on dev 3
cuda0003:2752870:2752870 [2] NCCL INFO NVLS multicast support is not available on dev 2
cuda0003:2752869:2752869 [1] NCCL INFO comm 0x5599765a7070 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
cuda0003:2752870:2752870 [2] NCCL INFO comm 0x5599765d84d0 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
cuda0003:2752869:2752869 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
cuda0003:2752520:2752520 [0] NCCL INFO comm 0x5599765a56d0 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
cuda0003:2752869:2752869 [1] NCCL INFO P2P Chunksize set to 524288
cuda0003:2752871:2752871 [3] NCCL INFO comm 0x5599765dd3b0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
cuda0003:2752520:2752520 [0] NCCL INFO Channel 00/02 :    0   1   2   3
cuda0003:2752870:2752870 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
cuda0003:2752870:2752870 [2] NCCL INFO P2P Chunksize set to 524288
cuda0003:2752520:2752520 [0] NCCL INFO Channel 01/02 :    0   1   2   3
cuda0003:2752871:2752871 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
cuda0003:2752520:2752520 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
cuda0003:2752871:2752871 [3] NCCL INFO P2P Chunksize set to 524288
cuda0003:2752520:2752520 [0] NCCL INFO P2P Chunksize set to 524288
cuda0003:2752869:2752869 [1] NCCL INFO Channel 00/0 : 1[3] -> 2[4] via P2P/IPC
cuda0003:2752520:2752520 [0] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/IPC
cuda0003:2752870:2752870 [2] NCCL INFO Channel 00/0 : 2[4] -> 3[5] via P2P/IPC
cuda0003:2752871:2752871 [3] NCCL INFO Channel 00/0 : 3[5] -> 0[2] via P2P/IPC
cuda0003:2752869:2752869 [1] NCCL INFO Channel 01/0 : 1[3] -> 2[4] via P2P/IPC
cuda0003:2752520:2752520 [0] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/IPC
cuda0003:2752870:2752870 [2] NCCL INFO Channel 01/0 : 2[4] -> 3[5] via P2P/IPC
cuda0003:2752871:2752871 [3] NCCL INFO Channel 01/0 : 3[5] -> 0[2] via P2P/IPC
cuda0003:2752869:2752869 [1] NCCL INFO Connected all rings
cuda0003:2752520:2752520 [0] NCCL INFO Connected all rings
cuda0003:2752870:2752870 [2] NCCL INFO Connected all rings
cuda0003:2752871:2752871 [3] NCCL INFO Connected all rings
cuda0003:2752871:2752871 [3] NCCL INFO Channel 00/0 : 3[5] -> 2[4] via P2P/IPC
cuda0003:2752869:2752869 [1] NCCL INFO Channel 00/0 : 1[3] -> 0[2] via P2P/IPC
cuda0003:2752871:2752871 [3] NCCL INFO Channel 01/0 : 3[5] -> 2[4] via P2P/IPC
cuda0003:2752869:2752869 [1] NCCL INFO Channel 01/0 : 1[3] -> 0[2] via P2P/IPC
cuda0003:2752870:2752870 [2] NCCL INFO Channel 00/0 : 2[4] -> 1[3] via P2P/IPC
cuda0003:2752870:2752870 [2] NCCL INFO Channel 01/0 : 2[4] -> 1[3] via P2P/IPC
cuda0003:2752520:2752520 [0] NCCL INFO Connected all trees
cuda0003:2752520:2752520 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2752520:2752520 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2752869:2752869 [1] NCCL INFO Connected all trees
cuda0003:2752869:2752869 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2752869:2752869 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2752871:2752871 [3] NCCL INFO Connected all trees
cuda0003:2752871:2752871 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2752870:2752870 [2] NCCL INFO Connected all trees
cuda0003:2752871:2752871 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2752870:2752870 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2752870:2752870 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2752870:2752870 [2] NCCL INFO comm 0x5599765d84d0 rank 2 nranks 4 cudaDev 2 nvmlDev 4 busId 81000 commId 0xd493450290593fa - Init COMPLETE
cuda0003:2752520:2752520 [0] NCCL INFO comm 0x5599765a56d0 rank 0 nranks 4 cudaDev 0 nvmlDev 2 busId 41000 commId 0xd493450290593fa - Init COMPLETE
cuda0003:2752871:2752871 [3] NCCL INFO comm 0x5599765dd3b0 rank 3 nranks 4 cudaDev 3 nvmlDev 5 busId a1000 commId 0xd493450290593fa - Init COMPLETE
cuda0003:2752869:2752869 [1] NCCL INFO comm 0x5599765a7070 rank 1 nranks 4 cudaDev 1 nvmlDev 3 busId 61000 commId 0xd493450290593fa - Init COMPLETE
DEBUG 10-01 13:21:54 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:04 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:14 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:24 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:34 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-01 13:22:44 client.py:164] Waiting for output from MQLLMEngine.

And running the test script (from the trouble shooting guide) generates the following output:

(vllm) user@cuda0003:/export/home/scratch/user/dev/huggingface$ CUDA_VISIBLE_DEVICES=2,3,4,5 NCCL_DEBUG=TRACE torchrun --nproc-per-node 4 test_vllm.py 
W1001 15:56:02.450000 140287812335424 torch/distributed/run.py:779] 
W1001 15:56:02.450000 140287812335424 torch/distributed/run.py:779] *****************************************
W1001 15:56:02.450000 140287812335424 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1001 15:56:02.450000 140287812335424 torch/distributed/run.py:779] *****************************************
[W1001 15:56:02.572902975 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:29500 (errno: 97 - Address family not supported by protocol).
[W1001 15:56:03.655999094 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W1001 15:56:03.664883302 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:29500 (errno: 97 - Address family not supported by protocol).
[W1001 15:56:03.699935605 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W1001 15:56:03.701742213 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.xxx.com]:29500 (errno: 97 - Address family not supported by protocol).
cuda0003:2949601:2949601 [0] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2949601:2949601 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2949601:2949601 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
cuda0003:2949602:2949602 [1] NCCL INFO cudaDriverVersion 12020
cuda0003:2949603:2949603 [2] NCCL INFO cudaDriverVersion 12020
cuda0003:2949602:2949602 [1] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2949603:2949603 [2] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2949602:2949602 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2949603:2949603 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2949604:2949604 [3] NCCL INFO cudaDriverVersion 12020
cuda0003:2949604:2949604 [3] NCCL INFO Bootstrap : Using enp35s0f0:10.107.72.224<0>
cuda0003:2949604:2949604 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
cuda0003:2949601:2949642 [0] NCCL INFO NET/IB : No device found.
cuda0003:2949601:2949642 [0] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2949601:2949642 [0] NCCL INFO Using non-device net plugin version 0
cuda0003:2949601:2949642 [0] NCCL INFO Using network Socket
cuda0003:2949602:2949643 [1] NCCL INFO NET/IB : No device found.
cuda0003:2949602:2949643 [1] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2949602:2949643 [1] NCCL INFO Using non-device net plugin version 0
cuda0003:2949602:2949643 [1] NCCL INFO Using network Socket
cuda0003:2949603:2949644 [2] NCCL INFO NET/IB : No device found.
cuda0003:2949603:2949644 [2] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2949603:2949644 [2] NCCL INFO Using non-device net plugin version 0
cuda0003:2949603:2949644 [2] NCCL INFO Using network Socket
cuda0003:2949604:2949645 [3] NCCL INFO NET/IB : No device found.
cuda0003:2949604:2949645 [3] NCCL INFO NET/Socket : Using [0]enp35s0f0:10.107.72.224<0>
cuda0003:2949604:2949645 [3] NCCL INFO Using non-device net plugin version 0
cuda0003:2949604:2949645 [3] NCCL INFO Using network Socket
cuda0003:2949602:2949643 [1] NCCL INFO comm 0x5649b30cf960 rank 1 nranks 4 cudaDev 1 nvmlDev 3 busId 61000 commId 0x795520212f433144 - Init START
cuda0003:2949601:2949642 [0] NCCL INFO comm 0x55fb6f3ce680 rank 0 nranks 4 cudaDev 0 nvmlDev 2 busId 41000 commId 0x795520212f433144 - Init START
cuda0003:2949603:2949644 [2] NCCL INFO comm 0x55703115d1f0 rank 2 nranks 4 cudaDev 2 nvmlDev 4 busId 81000 commId 0x795520212f433144 - Init START
cuda0003:2949604:2949645 [3] NCCL INFO comm 0x55991793c100 rank 3 nranks 4 cudaDev 3 nvmlDev 5 busId a1000 commId 0x795520212f433144 - Init START
cuda0003:2949603:2949644 [2] NCCL INFO Setting affinity for GPU 4 to ffff0000,ffff0000
cuda0003:2949603:2949644 [2] NCCL INFO NVLS multicast support is not available on dev 2
cuda0003:2949604:2949645 [3] NCCL INFO Setting affinity for GPU 5 to ffff0000,ffff0000
cuda0003:2949604:2949645 [3] NCCL INFO NVLS multicast support is not available on dev 3
cuda0003:2949601:2949642 [0] NCCL INFO Setting affinity for GPU 2 to ffff,0000ffff
cuda0003:2949601:2949642 [0] NCCL INFO NVLS multicast support is not available on dev 0
cuda0003:2949602:2949643 [1] NCCL INFO Setting affinity for GPU 3 to ffff,0000ffff
cuda0003:2949602:2949643 [1] NCCL INFO NVLS multicast support is not available on dev 1
cuda0003:2949601:2949642 [0] NCCL INFO comm 0x55fb6f3ce680 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
cuda0003:2949601:2949642 [0] NCCL INFO Channel 00/02 :    0   1   2   3
cuda0003:2949601:2949642 [0] NCCL INFO Channel 01/02 :    0   1   2   3
cuda0003:2949604:2949645 [3] NCCL INFO comm 0x55991793c100 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
cuda0003:2949601:2949642 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
cuda0003:2949602:2949643 [1] NCCL INFO comm 0x5649b30cf960 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
cuda0003:2949603:2949644 [2] NCCL INFO comm 0x55703115d1f0 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
cuda0003:2949601:2949642 [0] NCCL INFO P2P Chunksize set to 524288
cuda0003:2949604:2949645 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
cuda0003:2949602:2949643 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
cuda0003:2949603:2949644 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
cuda0003:2949604:2949645 [3] NCCL INFO P2P Chunksize set to 524288
cuda0003:2949602:2949643 [1] NCCL INFO P2P Chunksize set to 524288
cuda0003:2949603:2949644 [2] NCCL INFO P2P Chunksize set to 524288
cuda0003:2949602:2949643 [1] NCCL INFO Channel 00/0 : 1[3] -> 2[4] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Channel 00/0 : 3[5] -> 0[2] via P2P/CUMEM
cuda0003:2949601:2949642 [0] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/CUMEM
cuda0003:2949602:2949643 [1] NCCL INFO Channel 01/0 : 1[3] -> 2[4] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Channel 01/0 : 3[5] -> 0[2] via P2P/CUMEM
cuda0003:2949601:2949642 [0] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/CUMEM
cuda0003:2949603:2949644 [2] NCCL INFO Channel 00/0 : 2[4] -> 3[5] via P2P/CUMEM
cuda0003:2949603:2949644 [2] NCCL INFO Channel 01/0 : 2[4] -> 3[5] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Connected all rings
cuda0003:2949604:2949645 [3] NCCL INFO Channel 00/0 : 3[5] -> 2[4] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Channel 01/0 : 3[5] -> 2[4] via P2P/CUMEM
cuda0003:2949603:2949644 [2] NCCL INFO Connected all rings
cuda0003:2949601:2949642 [0] NCCL INFO Connected all rings
cuda0003:2949603:2949644 [2] NCCL INFO Channel 00/0 : 2[4] -> 1[3] via P2P/CUMEM
cuda0003:2949603:2949644 [2] NCCL INFO Channel 01/0 : 2[4] -> 1[3] via P2P/CUMEM
cuda0003:2949602:2949643 [1] NCCL INFO Connected all rings
cuda0003:2949602:2949643 [1] NCCL INFO Channel 00/0 : 1[3] -> 0[2] via P2P/CUMEM
cuda0003:2949602:2949643 [1] NCCL INFO Channel 01/0 : 1[3] -> 0[2] via P2P/CUMEM
cuda0003:2949604:2949645 [3] NCCL INFO Connected all trees
cuda0003:2949604:2949645 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2949604:2949645 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2949601:2949642 [0] NCCL INFO Connected all trees
cuda0003:2949603:2949644 [2] NCCL INFO Connected all trees
cuda0003:2949601:2949642 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2949601:2949642 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2949603:2949644 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2949602:2949643 [1] NCCL INFO Connected all trees
cuda0003:2949603:2949644 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2949602:2949643 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
cuda0003:2949602:2949643 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
cuda0003:2949603:2949644 [2] NCCL INFO comm 0x55703115d1f0 rank 2 nranks 4 cudaDev 2 nvmlDev 4 busId 81000 commId 0x795520212f433144 - Init COMPLETE
cuda0003:2949602:2949643 [1] NCCL INFO comm 0x5649b30cf960 rank 1 nranks 4 cudaDev 1 nvmlDev 3 busId 61000 commId 0x795520212f433144 - Init COMPLETE
cuda0003:2949604:2949645 [3] NCCL INFO comm 0x55991793c100 rank 3 nranks 4 cudaDev 3 nvmlDev 5 busId a1000 commId 0x795520212f433144 - Init COMPLETE
cuda0003:2949601:2949642 [0] NCCL INFO comm 0x55fb6f3ce680 rank 0 nranks 4 cudaDev 0 nvmlDev 2 busId 41000 commId 0x795520212f433144 - Init COMPLETE
[rank2]:[E1001 16:06:04.537131842 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
[rank2]:[E1001 16:06:04.537356909 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E1001 16:06:04.539355394 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600012 milliseconds before timing out.
[rank0]:[E1001 16:06:04.539557761 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1001 16:06:04.547113185 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
[rank1]:[E1001 16:06:04.547325453 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1001 16:06:04.556502147 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600028 milliseconds before timing out.
[rank3]:[E1001 16:06:04.556720024 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]: Traceback (most recent call last):
[rank2]:   File "/export/home/scratch/user/dev/huggingface/test_vllm.py", line 12, in <module>
[rank2]:     assert value == world_size, f"Expected {world_size}, got {value}"
[rank2]: AssertionError: Expected 4, got 0.0
cuda0003:2949601:2949647 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[rank3]: Traceback (most recent call last):
[rank3]:   File "/export/home/scratch/user/dev/huggingface/test_vllm.py", line 12, in <module>
[rank3]:     assert value == world_size, f"Expected {world_size}, got {value}"
[rank3]: AssertionError: Expected 4, got 0.0
[rank1]: Traceback (most recent call last):
[rank1]:   File "/export/home/scratch/user/dev/huggingface/test_vllm.py", line 12, in <module>
[rank1]:     assert value == world_size, f"Expected {world_size}, got {value}"
[rank1]: AssertionError: Expected 4, got 0.0
[rank0]: Traceback (most recent call last):
[rank0]:   File "/export/home/scratch/user/dev/huggingface/test_vllm.py", line 12, in <module>
[rank0]:     assert value == world_size, f"Expected {world_size}, got {value}"
[rank0]: AssertionError: Expected 4, got 1.0
cuda0003:2949601:2949631 [0] NCCL INFO comm 0x55fb6f3ce680 rank 0 nranks 4 cudaDev 0 busId 41000 - Abort COMPLETE
[rank0]:[E1001 16:06:04.787342035 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E1001 16:06:04.787360155 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1001 16:06:04.787367634 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E1001 16:06:04.788294613 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600012 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6155dcff86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f61570cc8d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f61570d3313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f61570d56fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f61a487fdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f61a5be9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f61a59aa353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600012 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6155dcff86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f61570cc8d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f61570d3313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f61570d56fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f61a487fdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f61a5be9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f61a59aa353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6155dcff86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f6156d5ea84 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f61a487fdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f61a5be9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f61a59aa353 in /lib/x86_64-linux-gnu/libc.so.6)

cuda0003:2949603:2949650 [2] NCCL INFO [Service thread] Connection closed by localRank 2
cuda0003:2949603:2949625 [0] NCCL INFO comm 0x55703115d1f0 rank 2 nranks 4 cudaDev 2 busId 81000 - Abort COMPLETE
[rank2]:[E1001 16:06:04.857583158 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1001 16:06:04.857606318 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1001 16:06:04.857614068 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1001 16:06:04.858765584 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb77fbfcf86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb780ef98d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb780f00313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb780f026fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7fb7ce6acdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fb7cfa16609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fb7cf7d7353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb77fbfcf86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb780ef98d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb780f00313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb780f026fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7fb7ce6acdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fb7cfa16609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fb7cf7d7353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb77fbfcf86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7fb780b8ba84 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7fb7ce6acdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7fb7cfa16609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fb7cf7d7353 in /lib/x86_64-linux-gnu/libc.so.6)

cuda0003:2949604:2949648 [3] NCCL INFO [Service thread] Connection closed by localRank 3
cuda0003:2949602:2949649 [1] NCCL INFO [Service thread] Connection closed by localRank 1
cuda0003:2949604:2949623 [0] NCCL INFO comm 0x55991793c100 rank 3 nranks 4 cudaDev 3 busId a1000 - Abort COMPLETE
cuda0003:2949602:2949629 [0] NCCL INFO comm 0x5649b30cf960 rank 1 nranks 4 cudaDev 1 busId 61000 - Abort COMPLETE
[rank3]:[E1001 16:06:04.889919061 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1001 16:06:04.889949961 ProcessGroupNCCL.cpp:621] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1001 16:06:04.889963761 ProcessGroupNCCL.cpp:627] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1001 16:06:04.891761908 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600028 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f41ced77f86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f41d00748d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f41d007b313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f41d007d6fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f421d827df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f421eb91609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f421e952353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank1]:[E1001 16:06:04.892371650 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1001 16:06:04.892395890 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1001 16:06:04.892420760 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
  what():  [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600028 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f41ced77f86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f41d00748d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f41d007b313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f41d007d6fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f421d827df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f421eb91609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f421e952353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f41ced77f86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f41cfd06a84 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f421d827df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f421eb91609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f421e952353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E1001 16:06:04.893556365 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff248c6af86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ff249f678d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff249f6e313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff249f706fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7ff29771adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff298a88609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff298849353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=128, NumelOut=128, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff248c6af86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ff249f678d2 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff249f6e313 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff249f706fc in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7ff29771adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff298a88609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff298849353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff248c6af86 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7ff249bf9a84 in /export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7ff29771adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7ff298a88609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7ff298849353 in /lib/x86_64-linux-gnu/libc.so.6)

W1001 16:06:04.960000 140287812335424 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2949602 closing signal SIGTERM
W1001 16:06:04.960000 140287812335424 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2949603 closing signal SIGTERM
W1001 16:06:04.960000 140287812335424 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2949604 closing signal SIGTERM
E1001 16:06:05.125000 140287812335424 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 2949601) of binary: /export/home/cuda00032/user/miniforge3/envs/vllm/bin/python3.11
Traceback (most recent call last):
  File "/export/home/cuda00032/user/miniforge3/envs/vllm/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/cuda00032/user/miniforge3/envs/vllm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
test_vllm.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-01_16:06:04
  host      : Cuda0003.XX.XXX.COM
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 2949601)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2949601
========================================================
DarkLight1337 commented 1 month ago

Can you show the full output of collect_env.py? It should show the vLLM version and your GPU topology towards the end. It seems there is something wrong with the latter.

rchen19 commented 1 month ago

Can you show the full output of collect_env.py? It should show the vLLM version and your GPU topology towards the end. It seems there is something wrong with the latter.

You are right, I didn't realize the output was truncated, below is the full output. It says "vLLM Version: 0.6.1.dev238+ge2c6e0a82", but pip list | grep vllm gives vllm 0.6.2.

Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000
GPU 4: NVIDIA RTX A6000
GPU 5: NVIDIA RTX A6000
GPU 6: NVIDIA RTX A6000
GPU 7: NVIDIA RTX A6000

Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             64
On-line CPU(s) list:                0-63
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              1
Model name:                         AMD EPYC 7313 16-Core Processor
Stepping:                           1
Frequency boost:                    enabled
CPU MHz:                            3355.300
CPU max MHz:                        3000.0000
CPU min MHz:                        1500.0000
BogoMIPS:                           6000.12
Virtualization:                     AMD-V
L1d cache:                          1 MiB
L1i cache:                          1 MiB
L2 cache:                           16 MiB
L3 cache:                           256 MiB
NUMA node0 CPU(s):                  0-15,32-47
NUMA node1 CPU(s):                  16-31,48-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.68
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.1
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.68                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.4.0                    pypi_0    pypi
[conda] torchvision               0.19.0                   pypi_0    pypi
[conda] transformers              4.45.1                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.1.dev238+ge2c6e0a82
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV4     SYS     SYS     SYS     SYS     SYS     SYS     0-15,32-47      0               N/A
GPU1    NV4      X      SYS     SYS     SYS     SYS     SYS     SYS     0-15,32-47      0               N/A
GPU2    SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     0-15,32-47      0               N/A
GPU3    SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     0-15,32-47      0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NV4     SYS     SYS     16-31,48-63             N/A             N/A
GPU5    SYS     SYS     SYS     SYS     NV4      X      SYS     SYS     16-31,48-63             N/A             N/A
GPU6    SYS     SYS     SYS     SYS     SYS     SYS      X      NV4     16-31,48-63             N/A             N/A
GPU7    SYS     SYS     SYS     SYS     SYS     SYS     NV4      X      16-31,48-63             N/A             N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
DarkLight1337 commented 1 month ago

@youkaichao can you take a look at this?

youkaichao commented 1 month ago

AssertionError: Expected 4, got 1.0

it looks like some gpus are broken. allreduce correctness tests fail. can you try the rest gpus, or try some other machines?

rchen19 commented 1 month ago

AssertionError: Expected 4, got 1.0

I am fairly certain there are no hardware issues with these GPUs, we have been running jobs on them constantly. We had no problem serving minicpm-llama3-v-2-5 using vllm on one of the GPUs.

I just tried running the test script on single GPU individually by: CUDA_VISIBLE_DEVICES=4 NCCL_DEBUG=TRACE torchrun --nproc-per-node=1 test_vllm.py, and for all the GPUs separately, I am getting below output. Looks like a AttributeError: 'PyNcclCommunicator' object has no attribute 'device' caused the subprocess to fail. OK this is because when using a single GPU, the init method of PyNcclCommunicator will return before device attribute is set, so this is expected behavior.

It is still unclear to me if there is something wrong with communication between the GPUs, since serving a model on a single GPU using vllm does not seem to have an issue. And I have tried serving a llava model (llava-onevision-qwen2-7b-si) with SGLang on multiple GPUs without a problem.

youkaichao commented 1 month ago

CUDA_VISIBLE_DEVICES=4 NCCL_DEBUG=TRACE torchrun --nproc-per-node=1 test_vllm.py

the script needs to be run with multiple gpus.