[Bug]: vLLM 0.5.1 tensor parallel 2 stuck with Mixtral-8x7B-Instruct-v0.1

zhyncs commented 2 months ago

Your current environment

PyTorch version: 2.3.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11)
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.17

Python version: 3.9.16 (main, Aug 15 2023, 19:38:56)  [GCC 8.3.1 20190311 (Red Hat 8.3.1-3)] (64-bit runtime)
Python platform: Linux-4.18.0-147.mt20200626.413.el8_1.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB

Nvidia driver version: 470.103.01
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.9.2
/usr/lib64/libcudnn_adv_infer.so.8.9.2
/usr/lib64/libcudnn_adv_train.so.8.9.2
/usr/lib64/libcudnn_cnn_infer.so.8.9.2
/usr/lib64/libcudnn_cnn_train.so.8.9.2
/usr/lib64/libcudnn_ops_infer.so.8.9.2
/usr/lib64/libcudnn_ops_train.so.8.9.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          192
On-line CPU(s) list:             0-7
Off-line CPU(s) list:            8-191
Thread(s) per core:              0
Core(s) per socket:              48
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7642 48-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         3287.557
CPU max MHz:                     2300.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        4571.13
Virtualization:                  AMD-V
L1d cache:                       1.5 MiB
L1i cache:                       1.5 MiB
L2 cache:                        24 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-47,96-143
NUMA node1 CPU(s):               48-95,144-191
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

Versions of relevant libraries:
[pip3] flake8==7.1.0
[pip3] mt-tritonclient==1.0.4
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.26.3
[pip3] nvidia-nccl-cu11==2.20.5
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] onnx==1.12.0
[pip3] onnx-graphsurgeon==0.3.12
[pip3] onnxruntime==1.15.1
[pip3] scalellm==0.1.5+cu118torch2.3
[pip3] torch==2.3.0+cu118
[pip3] torchvision==0.18.1
[pip3] transformers==4.42.3
[pip3] transformers-stream-generator==0.0.4
[pip3] triton==2.3.0
[pip3] tritonclient==2.42.0
[pip3] vllm-nccl-cu11==2.18.1.0.4.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: ; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4  mlx5_5  mlx5_6  mlx5_7  CPU Affinity    NUMA Affinity
GPU0     X  NV12    SYS SYS SYS SYS NODE    NODE    PXB PXB 48-95,144-191   1
GPU1    NV12     X  SYS SYS SYS SYS NODE    NODE    PXB PXB 48-95,144-191   1
mlx5_0  SYS SYS  X  PIX NODE    NODE    SYS SYS SYS SYS
mlx5_1  SYS SYS PIX  X  NODE    NODE    SYS SYS SYS SYS
mlx5_2  SYS SYS NODE    NODE     X  PIX SYS SYS SYS SYS
mlx5_3  SYS SYS NODE    NODE    PIX  X  SYS SYS SYS SYS
mlx5_4  NODE    NODE    SYS SYS SYS SYS  X  PIX NODE    NODE
mlx5_5  NODE    NODE    SYS SYS SYS SYS PIX  X  NODE    NODE
mlx5_6  PXB PXB SYS SYS SYS SYS NODE    NODE     X  PIX
mlx5_7  PXB PXB SYS SYS SYS SYS NODE    NODE    PIX  X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

ref https://github.com/vllm-project/vllm/issues/6187#issuecomment-2212874496

python3 -m vllm.entrypoints.openai.api_server --model /workdir/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2

INFO 07-08 10:51:48 api_server.py:206] vLLM API server version 0.5.1
INFO 07-08 10:51:48 api_server.py:207] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/workdir/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-08 10:51:48 config.py:698] Defaulting to use mp for distributed inference
INFO 07-08 10:51:48 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/workdir/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='/workdir/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/workdir/Mixtral-8x7B-Instruct-v0.1, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-08 10:51:49 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
INFO 07-08 10:51:49 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=96623) INFO 07-08 10:51:49 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
(VllmWorkerProcess pid=96623) INFO 07-08 10:51:49 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=96623) INFO 07-08 10:51:50 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 07-08 10:51:51 utils.py:741] Found nccl from library libnccl.so.2
INFO 07-08 10:51:51 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=96623) INFO 07-08 10:51:51 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=96623) INFO 07-08 10:51:51 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-08 10:51:51 custom_all_reduce_utils.py:202] generating GPU P2P access cache in /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json

The startup is stuck here. cc @youkaichao

youkaichao commented 2 months ago

can you follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to provide more information?

zhyncs commented 2 months ago

ok

zhyncs commented 2 months ago

export VLLM_LOGGING_LEVEL=DEBUG
export CUDA_LAUNCH_BLOCKING=1
export NCCL_DEBUG=TRACE
export VLLM_TRACE_FUNCTION=1

INFO 07-08 11:49:48 api_server.py:206] vLLM API server version 0.5.1
INFO 07-08 11:49:48 api_server.py:207] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/workdir/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-08 11:49:48 config.py:698] Defaulting to use mp for distributed inference
INFO 07-08 11:49:48 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/workdir/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='/workdir/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/workdir/Mixtral-8x7B-Instruct-v0.1, use_v2_block_manager=False, enable_prefix_caching=False)
(VllmWorkerProcess pid=105174) WARNING 07-08 11:49:48 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
WARNING 07-08 11:49:48 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 07-08 11:49:48 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-5545b375c37d4ea29b6b862c47cfd1ad/VLLM_TRACE_FUNCTION_for_process_105155_thread_140020318103488_at_2024-07-08_11:49:48.812089.log
(VllmWorkerProcess pid=105174) INFO 07-08 11:49:48 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-5545b375c37d4ea29b6b862c47cfd1ad/VLLM_TRACE_FUNCTION_for_process_105174_thread_140020318103488_at_2024-07-08_11:49:48.812099.log
INFO 07-08 11:49:49 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
INFO 07-08 11:49:49 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=105174) INFO 07-08 11:49:49 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
(VllmWorkerProcess pid=105174) INFO 07-08 11:49:49 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=105174) INFO 07-08 11:50:02 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
DEBUG 07-08 11:50:02 parallel_state.py:799] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:46887 backend=nccl
(VllmWorkerProcess pid=105174) DEBUG 07-08 11:50:02 parallel_state.py:799] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:46887 backend=nccl
(VllmWorkerProcess pid=105174) INFO 07-08 11:50:02 utils.py:741] Found nccl from library libnccl.so.2
INFO 07-08 11:50:02 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=105174) INFO 07-08 11:50:02 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-08 11:50:02 pynccl.py:63] vLLM is using nccl==2.20.5
hostname:105155:105155 [0] NCCL INFO Bootstrap : Using eth0:10.164.53.65<0>
hostname:105155:105155 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hostname:105155:105155 [0] NCCL INFO cudaDriverVersion 11080
NCCL version 2.20.5+cuda11.0
hostname:105174:105174 [1] NCCL INFO cudaDriverVersion 11080
hostname:105174:105174 [1] NCCL INFO Bootstrap : Using eth0:10.164.53.65<0>
hostname:105174:105174 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hostname:105155:105155 [0] NCCL INFO NET/IB : Using [0]={[0] mlx5_7:1/RoCE, [1] mlx5_6:1/RoCE} ; OOB eth0:10.164.53.65<0>
hostname:105155:105155 [0] NCCL INFO Using non-device net plugin version 0
hostname:105155:105155 [0] NCCL INFO Using network IB
hostname:105174:105174 [1] NCCL INFO NET/IB : Using [0]={[0] mlx5_7:1/RoCE, [1] mlx5_6:1/RoCE} ; OOB eth0:10.164.53.65<0>
hostname:105174:105174 [1] NCCL INFO Using non-device net plugin version 0
hostname:105174:105174 [1] NCCL INFO Using network IB
hostname:105174:105174 [1] NCCL INFO comm 0xd3f6db0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d0000 commId 0xee65a98d1a9d5af3 - Init START
hostname:105155:105155 [0] NCCL INFO comm 0xd3f8480 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId cb000 commId 0xee65a98d1a9d5af3 - Init START
hostname:105174:105174 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 7.
hostname:105155:105155 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 7.
hostname:105174:105174 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
hostname:105155:105155 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
hostname:105174:105174 [1] NCCL INFO Setting affinity for GPU 1 to 01e000,00000000,00000000,0001e000,00000000,00000000
hostname:105155:105155 [0] NCCL INFO Setting affinity for GPU 0 to 01e000,00000000,00000000,0001e000,00000000,00000000
hostname:105174:105174 [1] NCCL INFO comm 0xd3f6db0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
hostname:105155:105155 [0] NCCL INFO comm 0xd3f8480 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
hostname:105155:105155 [0] NCCL INFO Channel 00/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 01/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 02/24 :    0   1
hostname:105174:105174 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
hostname:105155:105155 [0] NCCL INFO Channel 03/24 :    0   1
hostname:105174:105174 [1] NCCL INFO P2P Chunksize set to 524288
hostname:105155:105155 [0] NCCL INFO Channel 04/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 05/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 06/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 07/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 08/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 09/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 10/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 11/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 12/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 13/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 14/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 15/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 16/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 17/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 18/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 19/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 20/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 21/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 22/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Channel 23/24 :    0   1
hostname:105155:105155 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
hostname:105155:105155 [0] NCCL INFO P2P Chunksize set to 524288
hostname:105174:105174 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Connected all rings
hostname:105155:105155 [0] NCCL INFO Connected all trees
hostname:105174:105174 [1] NCCL INFO Connected all rings
hostname:105174:105174 [1] NCCL INFO Connected all trees
hostname:105174:105174 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
hostname:105174:105174 [1] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
hostname:105155:105155 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
hostname:105155:105155 [0] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
hostname:105155:105155 [0] NCCL INFO comm 0xd3f8480 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId cb000 commId 0xee65a98d1a9d5af3 - Init COMPLETE
hostname:105174:105174 [1] NCCL INFO comm 0xd3f6db0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d0000 commId 0xee65a98d1a9d5af3 - Init COMPLETE
INFO 07-08 11:50:03 custom_all_reduce_utils.py:202] generating GPU P2P access cache in /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json

cc @youkaichao

youkaichao commented 2 months ago

One potential issue I can tell, is your driver version is very old:

Nvidia driver version: 470.103.01

zhyncs commented 2 months ago

The company's machines make it inconvenient to upgrade the host machine's drivers. By the way, with the same machine environment configuration, using LMDeploy's TP is normal.

youkaichao commented 2 months ago

try adding --disable-custom-all-reduce ? your driver might be too old and custom allreduce initialization check might fail.

zhyncs commented 2 months ago

export VLLM_LOGGING_LEVEL=DEBUG
export CUDA_LAUNCH_BLOCKING=1
export NCCL_DEBUG=TRACE
export VLLM_TRACE_FUNCTION=1
python3 -m vllm.entrypoints.openai.api_server --model /workdir/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --disable-custom-all-reduce

INFO 07-08 13:04:44 api_server.py:206] vLLM API server version 0.5.1
INFO 07-08 13:04:44 api_server.py:207] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/workdir/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-08 13:04:44 config.py:698] Defaulting to use mp for distributed inference
INFO 07-08 13:04:44 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/workdir/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='/workdir/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/workdir/Mixtral-8x7B-Instruct-v0.1, use_v2_block_manager=False, enable_prefix_caching=False)
(VllmWorkerProcess pid=116058) WARNING 07-08 13:04:44 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:44 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-635b15bc668843ea86e11ebe782fe81a/VLLM_TRACE_FUNCTION_for_process_116058_thread_140518624049088_at_2024-07-08_13:04:44.204203.log
WARNING 07-08 13:04:44 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 07-08 13:04:44 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-635b15bc668843ea86e11ebe782fe81a/VLLM_TRACE_FUNCTION_for_process_116024_thread_140518624049088_at_2024-07-08_13:04:44.204129.log
INFO 07-08 13:04:44 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
INFO 07-08 13:04:44 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:44 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:44 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:57 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
DEBUG 07-08 13:04:57 parallel_state.py:799] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:40507 backend=nccl
(VllmWorkerProcess pid=116058) DEBUG 07-08 13:04:57 parallel_state.py:799] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:40507 backend=nccl
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:57 utils.py:741] Found nccl from library libnccl.so.2
INFO 07-08 13:04:57 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:57 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-08 13:04:57 pynccl.py:63] vLLM is using nccl==2.20.5
hostname:116024:116024 [0] NCCL INFO Bootstrap : Using eth0:10.164.53.65<0>
hostname:116024:116024 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hostname:116024:116024 [0] NCCL INFO cudaDriverVersion 11080
NCCL version 2.20.5+cuda11.0
hostname:116058:116058 [1] NCCL INFO cudaDriverVersion 11080
hostname:116058:116058 [1] NCCL INFO Bootstrap : Using eth0:10.164.53.65<0>
hostname:116058:116058 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hostname:116024:116024 [0] NCCL INFO NET/IB : Using [0]={[0] mlx5_7:1/RoCE, [1] mlx5_6:1/RoCE} ; OOB eth0:10.164.53.65<0>
hostname:116024:116024 [0] NCCL INFO Using non-device net plugin version 0
hostname:116024:116024 [0] NCCL INFO Using network IB
hostname:116058:116058 [1] NCCL INFO NET/IB : Using [0]={[0] mlx5_7:1/RoCE, [1] mlx5_6:1/RoCE} ; OOB eth0:10.164.53.65<0>
hostname:116058:116058 [1] NCCL INFO Using non-device net plugin version 0
hostname:116058:116058 [1] NCCL INFO Using network IB
hostname:116058:116058 [1] NCCL INFO comm 0xde702f0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d0000 commId 0x9df62c06a0653412 - Init START
hostname:116024:116024 [0] NCCL INFO comm 0xde719e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId cb000 commId 0x9df62c06a0653412 - Init START
hostname:116058:116058 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 7.
hostname:116024:116024 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 7.
hostname:116058:116058 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
hostname:116024:116024 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
hostname:116058:116058 [1] NCCL INFO Setting affinity for GPU 1 to 01e000,00000000,00000000,0001e000,00000000,00000000
hostname:116024:116024 [0] NCCL INFO Setting affinity for GPU 0 to 01e000,00000000,00000000,0001e000,00000000,00000000
hostname:116058:116058 [1] NCCL INFO comm 0xde702f0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
hostname:116024:116024 [0] NCCL INFO comm 0xde719e0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
hostname:116024:116024 [0] NCCL INFO Channel 00/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 01/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 02/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 03/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 04/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 05/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 06/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 07/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 08/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 09/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 10/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 11/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 12/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 13/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 14/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 15/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 16/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 17/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 18/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 19/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 20/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 21/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 22/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Channel 23/24 :    0   1
hostname:116024:116024 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
hostname:116024:116024 [0] NCCL INFO P2P Chunksize set to 524288
hostname:116058:116058 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
hostname:116058:116058 [1] NCCL INFO P2P Chunksize set to 524288
hostname:116024:116024 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Connected all rings
hostname:116024:116024 [0] NCCL INFO Connected all trees
hostname:116058:116058 [1] NCCL INFO Connected all rings
hostname:116058:116058 [1] NCCL INFO Connected all trees
hostname:116058:116058 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
hostname:116058:116058 [1] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
hostname:116024:116024 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
hostname:116024:116024 [0] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
hostname:116024:116024 [0] NCCL INFO comm 0xde719e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId cb000 commId 0x9df62c06a0653412 - Init COMPLETE
hostname:116058:116058 [1] NCCL INFO comm 0xde702f0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d0000 commId 0x9df62c06a0653412 - Init COMPLETE
INFO 07-08 13:04:58 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:58 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
INFO 07-08 13:04:58 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:58 selector.py:53] Using XFormers backend.
INFO 07-08 13:05:19 model_runner.py:255] Loading model weights took 43.5064 GB
(VllmWorkerProcess pid=116058) INFO 07-08 13:05:19 model_runner.py:255] Loading model weights took 43.5064 GB
hostname:116024:116181 [0] NCCL INFO Using non-device net plugin version 0
hostname:116024:116181 [0] NCCL INFO Using network IB
hostname:116058:116182 [1] NCCL INFO Using non-device net plugin version 0
hostname:116058:116182 [1] NCCL INFO Using network IB
hostname:116058:116182 [1] NCCL INFO comm 0x10cb39f0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d0000 commId 0xae62d389292366e4 - Init START
hostname:116024:116181 [0] NCCL INFO comm 0x10cb0bc0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId cb000 commId 0xae62d389292366e4 - Init START
hostname:116024:116181 [0] NCCL INFO Setting affinity for GPU 0 to 01e000,00000000,00000000,0001e000,00000000,00000000
hostname:116058:116182 [1] NCCL INFO Setting affinity for GPU 1 to 01e000,00000000,00000000,0001e000,00000000,00000000
hostname:116024:116181 [0] NCCL INFO comm 0x10cb0bc0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
hostname:116058:116182 [1] NCCL INFO comm 0x10cb39f0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
hostname:116024:116181 [0] NCCL INFO Channel 00/24 :    0   1
hostname:116058:116182 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
hostname:116024:116181 [0] NCCL INFO Channel 01/24 :    0   1
hostname:116058:116182 [1] NCCL INFO P2P Chunksize set to 524288
hostname:116024:116181 [0] NCCL INFO Channel 02/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 03/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 04/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 05/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 06/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 07/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 08/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 09/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 10/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 11/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 12/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 13/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 14/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 15/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 16/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 17/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 18/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 19/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 20/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 21/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 22/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Channel 23/24 :    0   1
hostname:116024:116181 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
hostname:116024:116181 [0] NCCL INFO P2P Chunksize set to 524288
hostname:116024:116181 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Connected all rings
hostname:116058:116182 [1] NCCL INFO Connected all trees
hostname:116024:116181 [0] NCCL INFO Connected all rings
hostname:116058:116182 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
hostname:116024:116181 [0] NCCL INFO Connected all trees
hostname:116058:116182 [1] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
hostname:116024:116181 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
hostname:116024:116181 [0] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
hostname:116024:116181 [0] NCCL INFO comm 0x10cb0bc0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId cb000 commId 0xae62d389292366e4 - Init COMPLETE
hostname:116058:116182 [1] NCCL INFO comm 0x10cb39f0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d0000 commId 0xae62d389292366e4 - Init COMPLETE
INFO 07-08 13:05:20 fused_moe.py:301] Using configuration from /usr/local/lib/python3.9/site-packages/vllm/model_executor/layers/fused_moe/configs/E=8,N=7168,device_name=NVIDIA_A100-SXM4-80GB.json for MoE layer.
(VllmWorkerProcess pid=116058) INFO 07-08 13:05:20 fused_moe.py:301] Using configuration from /usr/local/lib/python3.9/site-packages/vllm/model_executor/layers/fused_moe/configs/E=8,N=7168,device_name=NVIDIA_A100-SXM4-80GB.json for MoE layer.
hostname:116058:116215 [1] NCCL INFO Channel 00/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 01/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 02/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 03/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 04/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 05/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 06/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 07/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 08/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 09/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 10/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 11/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 12/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 13/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 14/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 15/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 16/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 17/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 18/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 19/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 20/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 21/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 22/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 23/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 24/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 25/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 26/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 27/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 28/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 29/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 30/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 31/1 : 1[1] -> 0[0] via P2P/IPC/read
INFO 07-08 13:05:27 distributed_gpu_executor.py:56] # GPU blocks: 22480, # CPU blocks: 4096
INFO 07-08 13:05:31 model_runner.py:924] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-08 13:05:31 model_runner.py:928] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=116058) INFO 07-08 13:05:31 model_runner.py:924] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=116058) INFO 07-08 13:05:31 model_runner.py:928] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

hostname:116024:116024 [0] misc/strongstream.cc:53 NCCL WARN NCCL cannot be captured in a graph if either it wasn't built with CUDA runtime >= 11.3 or if the installed CUDA driver < R465.
hostname:116024:116024 [0] NCCL INFO enqueue.cc:1935 -> 5
hostname:116024:116024 [0] NCCL INFO enqueue.cc:1976 -> 5
hostname:116024:116024 [0] NCCL INFO enqueue.cc:1981 -> 5

hostname:116058:116058 [1] misc/strongstream.cc:53 NCCL WARN NCCL cannot be captured in a graph if either it wasn't built with CUDA runtime >= 11.3 or if the installed CUDA driver < R465.
hostname:116058:116058 [1] NCCL INFO enqueue.cc:1935 -> 5
hostname:116058:116058 [1] NCCL INFO enqueue.cc:1976 -> 5
hostname:116058:116058 [1] NCCL INFO enqueue.cc:1981 -> 5
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method initialize_cache: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details), Traceback (most recent call last):
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/vllm/worker/worker.py", line 214, in initialize_cache
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     self._warm_up_model()
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/vllm/worker/worker.py", line 230, in _warm_up_model
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     self.model_runner.capture_model(self.gpu_cache)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1109, in capture_model
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     graph_runner.capture(**capture_inputs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1340, in capture
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     output_hidden_or_intermediate_states = self.model(
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/models/mixtral.py", line 348, in forward
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/models/mixtral.py", line 272, in forward
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     hidden_states = self.embed_tokens(input_ids)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 350, in forward
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     output = tensor_model_parallel_all_reduce(output_parallel)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/vllm/distributed/communication_op.py", line 11, in tensor_model_parallel_all_reduce
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     return get_tp_group().all_reduce(input_)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/vllm/distributed/parallel_state.py", line 290, in all_reduce
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     pynccl_comm.all_reduce(input_)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/vllm/distributed/device_communicators/pynccl.py", line 118, in all_reduce
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     self.nccl.ncclAllReduce(buffer_type(tensor.data_ptr()),
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 257, in ncclAllReduce
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     self.NCCL_CHECK(self._funcs["ncclAllReduce"](sendbuff, recvbuff, count,
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.9/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]     raise RuntimeError(f"NCCL error: {error_str}")
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 216, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 360, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 256, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 366, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
[rank0]:     self._run_workers("initialize_cache",
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 130, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/worker/worker.py", line 214, in initialize_cache
[rank0]:     self._warm_up_model()
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/worker/worker.py", line 230, in _warm_up_model
[rank0]:     self.model_runner.capture_model(self.gpu_cache)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1109, in capture_model
[rank0]:     graph_runner.capture(**capture_inputs)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1340, in capture
[rank0]:     output_hidden_or_intermediate_states = self.model(
[rank0]:   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/models/mixtral.py", line 348, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/models/mixtral.py", line 272, in forward
[rank0]:     hidden_states = self.embed_tokens(input_ids)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 350, in forward
[rank0]:     output = tensor_model_parallel_all_reduce(output_parallel)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/distributed/communication_op.py", line 11, in tensor_model_parallel_all_reduce
[rank0]:     return get_tp_group().all_reduce(input_)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/distributed/parallel_state.py", line 290, in all_reduce
[rank0]:     pynccl_comm.all_reduce(input_)
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/distributed/device_communicators/pynccl.py", line 118, in all_reduce
[rank0]:     self.nccl.ncclAllReduce(buffer_type(tensor.data_ptr()),
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 257, in ncclAllReduce
[rank0]:     self.NCCL_CHECK(self._funcs["ncclAllReduce"](sendbuff, recvbuff, count,
[rank0]:   File "/usr/local/lib/python3.9/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
[rank0]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank0]: RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
INFO 07-08 13:05:38 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=0x11df7d0)

Current thread 0x00007fcd0aabb7c0 (most recent call first):
<no Python frame>

Current thread 0x00007fcd0aabb7c0 (most recent call first):
<no Python frame>
/usr/local/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Aborted (core dumped)

zhyncs commented 2 months ago

The existing machines and historical software versions have been around for a long time, perhaps we should also consider compatibility. For large companies with self-built data centers, the upgrade process may take a longer period of time.

youkaichao commented 2 months ago

you can see some warning here:

hostname:116058:116058 [1] misc/strongstream.cc:53 NCCL WARN NCCL cannot be captured in a graph if either it wasn't built with CUDA runtime >= 11.3 or if the installed CUDA driver < R465.

I would suggest you contact admin to update the driver. Or you can try to add --enforce-eager .

zhyncs commented 2 months ago

# server
# vLLM 0.5.1
python3 -m vllm.entrypoints.openai.api_server --model /workdir/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --disable-custom-all-reduce --enforce-eager

# client
python3 benchmark_serving.py --backend vllm --host 127.0.0.1 --port 8000 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --model /workdir/Mixtral-8x7B-Instruct-v0.1 --num-prompts 1000 --request-rate 128

The server can start normally, but the performance is quite poor, nearly in an unavailable state. Does --disable-custom-all-reduce --enforce-eager have such a big impact on performance?

youkaichao commented 2 months ago

yes, both of them reduce performance.

zhyncs commented 2 months ago

The upgrade of the host machine hardware drivers usually has a long cycle, and considering the online production environment, it needs to be fully verified. As far as I know, many companies have a large number of historical versions similar to those described in the issue. Based on the current compatibility of vLLM 0.5.1 with these historical versions, it is not very user-friendly.

zhyncs commented 2 months ago

yes, both of them reduce performance.

make sense

youkaichao commented 2 months ago

we are a small team, and can only test / work for performance optimization of mainstream setting. a very old driver is not something we will support anyway. it can break at any time without any guarentee. the mainstream driver we see in gcp / aws should be 535 or so.

zhyncs commented 2 months ago

I understand that you currently have no plans to work on compatibility. I completely understand, especially since the team is very small. I have closed the issue for now. If more people encounter this problem in the future, we can consider reopening it or finding a better solution. Thanks anyway.

Lzhang-hub commented 1 month ago

@zhyncs The server can start normally, but the performance is quite poor, nearly in an unavailable state It may be caused by test serving with export VLLM_LOGGING_LEVEL=DEBUG export CUDA_LAUNCH_BLOCKING=1 export NCCL_DEBUG=TRACE export VLLM_TRACE_FUNCTION=1, you can try start server without these env.

vllm-project / vllm

[Bug]: vLLM 0.5.1 tensor parallel 2 stuck with Mixtral-8x7B-Instruct-v0.1 #6201

Your current environment

🐛 Describe the bug