Closed zhyncs closed 2 months ago
can you follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to provide more information?
ok
export VLLM_LOGGING_LEVEL=DEBUG
export CUDA_LAUNCH_BLOCKING=1
export NCCL_DEBUG=TRACE
export VLLM_TRACE_FUNCTION=1
INFO 07-08 11:49:48 api_server.py:206] vLLM API server version 0.5.1
INFO 07-08 11:49:48 api_server.py:207] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/workdir/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-08 11:49:48 config.py:698] Defaulting to use mp for distributed inference
INFO 07-08 11:49:48 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/workdir/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='/workdir/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/workdir/Mixtral-8x7B-Instruct-v0.1, use_v2_block_manager=False, enable_prefix_caching=False)
(VllmWorkerProcess pid=105174) WARNING 07-08 11:49:48 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
WARNING 07-08 11:49:48 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 07-08 11:49:48 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-5545b375c37d4ea29b6b862c47cfd1ad/VLLM_TRACE_FUNCTION_for_process_105155_thread_140020318103488_at_2024-07-08_11:49:48.812089.log
(VllmWorkerProcess pid=105174) INFO 07-08 11:49:48 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-5545b375c37d4ea29b6b862c47cfd1ad/VLLM_TRACE_FUNCTION_for_process_105174_thread_140020318103488_at_2024-07-08_11:49:48.812099.log
INFO 07-08 11:49:49 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
INFO 07-08 11:49:49 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=105174) INFO 07-08 11:49:49 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
(VllmWorkerProcess pid=105174) INFO 07-08 11:49:49 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=105174) INFO 07-08 11:50:02 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
DEBUG 07-08 11:50:02 parallel_state.py:799] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:46887 backend=nccl
(VllmWorkerProcess pid=105174) DEBUG 07-08 11:50:02 parallel_state.py:799] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:46887 backend=nccl
(VllmWorkerProcess pid=105174) INFO 07-08 11:50:02 utils.py:741] Found nccl from library libnccl.so.2
INFO 07-08 11:50:02 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=105174) INFO 07-08 11:50:02 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-08 11:50:02 pynccl.py:63] vLLM is using nccl==2.20.5
hostname:105155:105155 [0] NCCL INFO Bootstrap : Using eth0:10.164.53.65<0>
hostname:105155:105155 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hostname:105155:105155 [0] NCCL INFO cudaDriverVersion 11080
NCCL version 2.20.5+cuda11.0
hostname:105174:105174 [1] NCCL INFO cudaDriverVersion 11080
hostname:105174:105174 [1] NCCL INFO Bootstrap : Using eth0:10.164.53.65<0>
hostname:105174:105174 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hostname:105155:105155 [0] NCCL INFO NET/IB : Using [0]={[0] mlx5_7:1/RoCE, [1] mlx5_6:1/RoCE} ; OOB eth0:10.164.53.65<0>
hostname:105155:105155 [0] NCCL INFO Using non-device net plugin version 0
hostname:105155:105155 [0] NCCL INFO Using network IB
hostname:105174:105174 [1] NCCL INFO NET/IB : Using [0]={[0] mlx5_7:1/RoCE, [1] mlx5_6:1/RoCE} ; OOB eth0:10.164.53.65<0>
hostname:105174:105174 [1] NCCL INFO Using non-device net plugin version 0
hostname:105174:105174 [1] NCCL INFO Using network IB
hostname:105174:105174 [1] NCCL INFO comm 0xd3f6db0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d0000 commId 0xee65a98d1a9d5af3 - Init START
hostname:105155:105155 [0] NCCL INFO comm 0xd3f8480 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId cb000 commId 0xee65a98d1a9d5af3 - Init START
hostname:105174:105174 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 7.
hostname:105155:105155 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 7.
hostname:105174:105174 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
hostname:105155:105155 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
hostname:105174:105174 [1] NCCL INFO Setting affinity for GPU 1 to 01e000,00000000,00000000,0001e000,00000000,00000000
hostname:105155:105155 [0] NCCL INFO Setting affinity for GPU 0 to 01e000,00000000,00000000,0001e000,00000000,00000000
hostname:105174:105174 [1] NCCL INFO comm 0xd3f6db0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
hostname:105155:105155 [0] NCCL INFO comm 0xd3f8480 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
hostname:105155:105155 [0] NCCL INFO Channel 00/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 01/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 02/24 : 0 1
hostname:105174:105174 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
hostname:105155:105155 [0] NCCL INFO Channel 03/24 : 0 1
hostname:105174:105174 [1] NCCL INFO P2P Chunksize set to 524288
hostname:105155:105155 [0] NCCL INFO Channel 04/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 05/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 06/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 07/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 08/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 09/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 10/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 11/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 12/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 13/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 14/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 15/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 16/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 17/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 18/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 19/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 20/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 21/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 22/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Channel 23/24 : 0 1
hostname:105155:105155 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
hostname:105155:105155 [0] NCCL INFO P2P Chunksize set to 524288
hostname:105174:105174 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:105174:105174 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:105155:105155 [0] NCCL INFO Connected all rings
hostname:105155:105155 [0] NCCL INFO Connected all trees
hostname:105174:105174 [1] NCCL INFO Connected all rings
hostname:105174:105174 [1] NCCL INFO Connected all trees
hostname:105174:105174 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
hostname:105174:105174 [1] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
hostname:105155:105155 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
hostname:105155:105155 [0] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
hostname:105155:105155 [0] NCCL INFO comm 0xd3f8480 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId cb000 commId 0xee65a98d1a9d5af3 - Init COMPLETE
hostname:105174:105174 [1] NCCL INFO comm 0xd3f6db0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d0000 commId 0xee65a98d1a9d5af3 - Init COMPLETE
INFO 07-08 11:50:03 custom_all_reduce_utils.py:202] generating GPU P2P access cache in /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
cc @youkaichao
One potential issue I can tell, is your driver version is very old:
Nvidia driver version: 470.103.01
The company's machines make it inconvenient to upgrade the host machine's drivers. By the way, with the same machine environment configuration, using LMDeploy's TP is normal.
try adding --disable-custom-all-reduce
? your driver might be too old and custom allreduce initialization check might fail.
export VLLM_LOGGING_LEVEL=DEBUG
export CUDA_LAUNCH_BLOCKING=1
export NCCL_DEBUG=TRACE
export VLLM_TRACE_FUNCTION=1
python3 -m vllm.entrypoints.openai.api_server --model /workdir/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --disable-custom-all-reduce
INFO 07-08 13:04:44 api_server.py:206] vLLM API server version 0.5.1
INFO 07-08 13:04:44 api_server.py:207] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/workdir/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-08 13:04:44 config.py:698] Defaulting to use mp for distributed inference
INFO 07-08 13:04:44 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/workdir/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='/workdir/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/workdir/Mixtral-8x7B-Instruct-v0.1, use_v2_block_manager=False, enable_prefix_caching=False)
(VllmWorkerProcess pid=116058) WARNING 07-08 13:04:44 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:44 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-635b15bc668843ea86e11ebe782fe81a/VLLM_TRACE_FUNCTION_for_process_116058_thread_140518624049088_at_2024-07-08_13:04:44.204203.log
WARNING 07-08 13:04:44 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 07-08 13:04:44 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-635b15bc668843ea86e11ebe782fe81a/VLLM_TRACE_FUNCTION_for_process_116024_thread_140518624049088_at_2024-07-08_13:04:44.204129.log
INFO 07-08 13:04:44 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
INFO 07-08 13:04:44 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:44 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:44 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:57 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
DEBUG 07-08 13:04:57 parallel_state.py:799] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:40507 backend=nccl
(VllmWorkerProcess pid=116058) DEBUG 07-08 13:04:57 parallel_state.py:799] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:40507 backend=nccl
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:57 utils.py:741] Found nccl from library libnccl.so.2
INFO 07-08 13:04:57 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:57 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-08 13:04:57 pynccl.py:63] vLLM is using nccl==2.20.5
hostname:116024:116024 [0] NCCL INFO Bootstrap : Using eth0:10.164.53.65<0>
hostname:116024:116024 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hostname:116024:116024 [0] NCCL INFO cudaDriverVersion 11080
NCCL version 2.20.5+cuda11.0
hostname:116058:116058 [1] NCCL INFO cudaDriverVersion 11080
hostname:116058:116058 [1] NCCL INFO Bootstrap : Using eth0:10.164.53.65<0>
hostname:116058:116058 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hostname:116024:116024 [0] NCCL INFO NET/IB : Using [0]={[0] mlx5_7:1/RoCE, [1] mlx5_6:1/RoCE} ; OOB eth0:10.164.53.65<0>
hostname:116024:116024 [0] NCCL INFO Using non-device net plugin version 0
hostname:116024:116024 [0] NCCL INFO Using network IB
hostname:116058:116058 [1] NCCL INFO NET/IB : Using [0]={[0] mlx5_7:1/RoCE, [1] mlx5_6:1/RoCE} ; OOB eth0:10.164.53.65<0>
hostname:116058:116058 [1] NCCL INFO Using non-device net plugin version 0
hostname:116058:116058 [1] NCCL INFO Using network IB
hostname:116058:116058 [1] NCCL INFO comm 0xde702f0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d0000 commId 0x9df62c06a0653412 - Init START
hostname:116024:116024 [0] NCCL INFO comm 0xde719e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId cb000 commId 0x9df62c06a0653412 - Init START
hostname:116058:116058 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 7.
hostname:116024:116024 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 7.
hostname:116058:116058 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
hostname:116024:116024 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
hostname:116058:116058 [1] NCCL INFO Setting affinity for GPU 1 to 01e000,00000000,00000000,0001e000,00000000,00000000
hostname:116024:116024 [0] NCCL INFO Setting affinity for GPU 0 to 01e000,00000000,00000000,0001e000,00000000,00000000
hostname:116058:116058 [1] NCCL INFO comm 0xde702f0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
hostname:116024:116024 [0] NCCL INFO comm 0xde719e0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
hostname:116024:116024 [0] NCCL INFO Channel 00/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 01/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 02/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 03/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 04/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 05/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 06/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 07/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 08/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 09/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 10/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 11/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 12/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 13/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 14/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 15/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 16/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 17/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 18/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 19/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 20/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 21/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 22/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Channel 23/24 : 0 1
hostname:116024:116024 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
hostname:116024:116024 [0] NCCL INFO P2P Chunksize set to 524288
hostname:116058:116058 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
hostname:116058:116058 [1] NCCL INFO P2P Chunksize set to 524288
hostname:116024:116024 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116058 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116024 [0] NCCL INFO Connected all rings
hostname:116024:116024 [0] NCCL INFO Connected all trees
hostname:116058:116058 [1] NCCL INFO Connected all rings
hostname:116058:116058 [1] NCCL INFO Connected all trees
hostname:116058:116058 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
hostname:116058:116058 [1] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
hostname:116024:116024 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
hostname:116024:116024 [0] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
hostname:116024:116024 [0] NCCL INFO comm 0xde719e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId cb000 commId 0x9df62c06a0653412 - Init COMPLETE
hostname:116058:116058 [1] NCCL INFO comm 0xde702f0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d0000 commId 0x9df62c06a0653412 - Init COMPLETE
INFO 07-08 13:04:58 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:58 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
INFO 07-08 13:04:58 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=116058) INFO 07-08 13:04:58 selector.py:53] Using XFormers backend.
INFO 07-08 13:05:19 model_runner.py:255] Loading model weights took 43.5064 GB
(VllmWorkerProcess pid=116058) INFO 07-08 13:05:19 model_runner.py:255] Loading model weights took 43.5064 GB
hostname:116024:116181 [0] NCCL INFO Using non-device net plugin version 0
hostname:116024:116181 [0] NCCL INFO Using network IB
hostname:116058:116182 [1] NCCL INFO Using non-device net plugin version 0
hostname:116058:116182 [1] NCCL INFO Using network IB
hostname:116058:116182 [1] NCCL INFO comm 0x10cb39f0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d0000 commId 0xae62d389292366e4 - Init START
hostname:116024:116181 [0] NCCL INFO comm 0x10cb0bc0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId cb000 commId 0xae62d389292366e4 - Init START
hostname:116024:116181 [0] NCCL INFO Setting affinity for GPU 0 to 01e000,00000000,00000000,0001e000,00000000,00000000
hostname:116058:116182 [1] NCCL INFO Setting affinity for GPU 1 to 01e000,00000000,00000000,0001e000,00000000,00000000
hostname:116024:116181 [0] NCCL INFO comm 0x10cb0bc0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
hostname:116058:116182 [1] NCCL INFO comm 0x10cb39f0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
hostname:116024:116181 [0] NCCL INFO Channel 00/24 : 0 1
hostname:116058:116182 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
hostname:116024:116181 [0] NCCL INFO Channel 01/24 : 0 1
hostname:116058:116182 [1] NCCL INFO P2P Chunksize set to 524288
hostname:116024:116181 [0] NCCL INFO Channel 02/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 03/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 04/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 05/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 06/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 07/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 08/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 09/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 10/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 11/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 12/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 13/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 14/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 15/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 16/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 17/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 18/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 19/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 20/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 21/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 22/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Channel 23/24 : 0 1
hostname:116024:116181 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
hostname:116024:116181 [0] NCCL INFO P2P Chunksize set to 524288
hostname:116024:116181 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116024:116181 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116182 [1] NCCL INFO Connected all rings
hostname:116058:116182 [1] NCCL INFO Connected all trees
hostname:116024:116181 [0] NCCL INFO Connected all rings
hostname:116058:116182 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
hostname:116024:116181 [0] NCCL INFO Connected all trees
hostname:116058:116182 [1] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
hostname:116024:116181 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
hostname:116024:116181 [0] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
hostname:116024:116181 [0] NCCL INFO comm 0x10cb0bc0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId cb000 commId 0xae62d389292366e4 - Init COMPLETE
hostname:116058:116182 [1] NCCL INFO comm 0x10cb39f0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId d0000 commId 0xae62d389292366e4 - Init COMPLETE
INFO 07-08 13:05:20 fused_moe.py:301] Using configuration from /usr/local/lib/python3.9/site-packages/vllm/model_executor/layers/fused_moe/configs/E=8,N=7168,device_name=NVIDIA_A100-SXM4-80GB.json for MoE layer.
(VllmWorkerProcess pid=116058) INFO 07-08 13:05:20 fused_moe.py:301] Using configuration from /usr/local/lib/python3.9/site-packages/vllm/model_executor/layers/fused_moe/configs/E=8,N=7168,device_name=NVIDIA_A100-SXM4-80GB.json for MoE layer.
hostname:116058:116215 [1] NCCL INFO Channel 00/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 01/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 02/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 03/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 04/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 05/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 06/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 07/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 08/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 09/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 10/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 11/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 12/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 13/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 14/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 15/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 16/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 17/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 18/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 19/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 20/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 21/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 22/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 23/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 24/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 25/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 26/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 27/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 28/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 29/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 30/1 : 1[1] -> 0[0] via P2P/IPC/read
hostname:116058:116215 [1] NCCL INFO Channel 31/1 : 1[1] -> 0[0] via P2P/IPC/read
INFO 07-08 13:05:27 distributed_gpu_executor.py:56] # GPU blocks: 22480, # CPU blocks: 4096
INFO 07-08 13:05:31 model_runner.py:924] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-08 13:05:31 model_runner.py:928] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=116058) INFO 07-08 13:05:31 model_runner.py:924] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=116058) INFO 07-08 13:05:31 model_runner.py:928] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
hostname:116024:116024 [0] misc/strongstream.cc:53 NCCL WARN NCCL cannot be captured in a graph if either it wasn't built with CUDA runtime >= 11.3 or if the installed CUDA driver < R465.
hostname:116024:116024 [0] NCCL INFO enqueue.cc:1935 -> 5
hostname:116024:116024 [0] NCCL INFO enqueue.cc:1976 -> 5
hostname:116024:116024 [0] NCCL INFO enqueue.cc:1981 -> 5
hostname:116058:116058 [1] misc/strongstream.cc:53 NCCL WARN NCCL cannot be captured in a graph if either it wasn't built with CUDA runtime >= 11.3 or if the installed CUDA driver < R465.
hostname:116058:116058 [1] NCCL INFO enqueue.cc:1935 -> 5
hostname:116058:116058 [1] NCCL INFO enqueue.cc:1976 -> 5
hostname:116058:116058 [1] NCCL INFO enqueue.cc:1981 -> 5
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method initialize_cache: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details), Traceback (most recent call last):
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/vllm/worker/worker.py", line 214, in initialize_cache
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] self._warm_up_model()
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/vllm/worker/worker.py", line 230, in _warm_up_model
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] self.model_runner.capture_model(self.gpu_cache)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1109, in capture_model
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] graph_runner.capture(**capture_inputs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1340, in capture
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] output_hidden_or_intermediate_states = self.model(
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/models/mixtral.py", line 348, in forward
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/models/mixtral.py", line 272, in forward
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] hidden_states = self.embed_tokens(input_ids)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 350, in forward
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] output = tensor_model_parallel_all_reduce(output_parallel)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/vllm/distributed/communication_op.py", line 11, in tensor_model_parallel_all_reduce
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] return get_tp_group().all_reduce(input_)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/vllm/distributed/parallel_state.py", line 290, in all_reduce
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] pynccl_comm.all_reduce(input_)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/vllm/distributed/device_communicators/pynccl.py", line 118, in all_reduce
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] self.nccl.ncclAllReduce(buffer_type(tensor.data_ptr()),
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 257, in ncclAllReduce
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] self.NCCL_CHECK(self._funcs["ncclAllReduce"](sendbuff, recvbuff, count,
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.9/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] raise RuntimeError(f"NCCL error: {error_str}")
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(VllmWorkerProcess pid=116058) ERROR 07-08 13:05:36 multiproc_worker_utils.py:226]
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 216, in <module>
[rank0]: engine = AsyncLLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 360, in __init__
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 256, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 366, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
[rank0]: self._run_workers("initialize_cache",
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 130, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/worker/worker.py", line 214, in initialize_cache
[rank0]: self._warm_up_model()
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/worker/worker.py", line 230, in _warm_up_model
[rank0]: self.model_runner.capture_model(self.gpu_cache)
[rank0]: File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1109, in capture_model
[rank0]: graph_runner.capture(**capture_inputs)
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1340, in capture
[rank0]: output_hidden_or_intermediate_states = self.model(
[rank0]: File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/models/mixtral.py", line 348, in forward
[rank0]: hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]: File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/models/mixtral.py", line 272, in forward
[rank0]: hidden_states = self.embed_tokens(input_ids)
[rank0]: File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 350, in forward
[rank0]: output = tensor_model_parallel_all_reduce(output_parallel)
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/distributed/communication_op.py", line 11, in tensor_model_parallel_all_reduce
[rank0]: return get_tp_group().all_reduce(input_)
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/distributed/parallel_state.py", line 290, in all_reduce
[rank0]: pynccl_comm.all_reduce(input_)
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/distributed/device_communicators/pynccl.py", line 118, in all_reduce
[rank0]: self.nccl.ncclAllReduce(buffer_type(tensor.data_ptr()),
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 257, in ncclAllReduce
[rank0]: self.NCCL_CHECK(self._funcs["ncclAllReduce"](sendbuff, recvbuff, count,
[rank0]: File "/usr/local/lib/python3.9/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
[rank0]: raise RuntimeError(f"NCCL error: {error_str}")
[rank0]: RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
INFO 07-08 13:05:38 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=0x11df7d0)
Current thread 0x00007fcd0aabb7c0 (most recent call first):
<no Python frame>
Current thread 0x00007fcd0aabb7c0 (most recent call first):
<no Python frame>
/usr/local/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Aborted (core dumped)
The existing machines and historical software versions have been around for a long time, perhaps we should also consider compatibility. For large companies with self-built data centers, the upgrade process may take a longer period of time.
you can see some warning here:
hostname:116058:116058 [1] misc/strongstream.cc:53 NCCL WARN NCCL cannot be captured in a graph if either it wasn't built with CUDA runtime >= 11.3 or if the installed CUDA driver < R465.
I would suggest you contact admin to update the driver. Or you can try to add --enforce-eager
.
# server
# vLLM 0.5.1
python3 -m vllm.entrypoints.openai.api_server --model /workdir/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --disable-custom-all-reduce --enforce-eager
# client
python3 benchmark_serving.py --backend vllm --host 127.0.0.1 --port 8000 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --model /workdir/Mixtral-8x7B-Instruct-v0.1 --num-prompts 1000 --request-rate 128
The server can start normally, but the performance is quite poor, nearly in an unavailable state. Does --disable-custom-all-reduce --enforce-eager
have such a big impact on performance?
yes, both of them reduce performance.
The upgrade of the host machine hardware drivers usually has a long cycle, and considering the online production environment, it needs to be fully verified. As far as I know, many companies have a large number of historical versions similar to those described in the issue. Based on the current compatibility of vLLM 0.5.1 with these historical versions, it is not very user-friendly.
yes, both of them reduce performance.
make sense
we are a small team, and can only test / work for performance optimization of mainstream setting. a very old driver is not something we will support anyway. it can break at any time without any guarentee. the mainstream driver we see in gcp / aws should be 535 or so.
I understand that you currently have no plans to work on compatibility. I completely understand, especially since the team is very small. I have closed the issue for now. If more people encounter this problem in the future, we can consider reopening it or finding a better solution. Thanks anyway.
@zhyncs
The server can start normally, but the performance is quite poor, nearly in an unavailable state
It may be caused by test serving with export VLLM_LOGGING_LEVEL=DEBUG export CUDA_LAUNCH_BLOCKING=1 export NCCL_DEBUG=TRACE export VLLM_TRACE_FUNCTION=1
, you can try start server without these env.
Your current environment
🐛 Describe the bug
ref https://github.com/vllm-project/vllm/issues/6187#issuecomment-2212874496
The startup is stuck here. cc @youkaichao