Closed zxcdsa45687 closed 2 months ago
I run into the same issue with vllm=0.4.3 with only one nccl version nvidia-nccl-cu12==2.20.5.
i meet same problem, Timed out waiting 1800000ms
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution. ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] Traceback (most recent call last): ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/vllm/worker/worker_base.py", line 140, in execut e_method ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] return executor(*args, kwargs) ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decora te_context ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] return func(*args, *kwargs) ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/vllm/worker/worker.py", line 286, in start_worke r_execution_loop ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] while self._execute_model_non_driver(): ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/vllm/worker/worker.py", line 295, in _execute_mo del_non_driver ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] data = broadcast_tensor_dict(src=0) ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/vllm/distributed/communication_op.py", line 284, in broadcast_tensor_dict ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] torch.distributed.broadcast_object_list(recv_metadata_list, ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in w rapper ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] return func(args, kwargs) ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 264 9, in broadcast_object_list ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] broadcast(object_sizes_tensor, src=src, group=group) ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in w rapper ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] return func(*args, **kwargs) ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 214 4, in broadcast ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] work.wait() ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete ESC[36m(RayWorkerWrapper pid=37200)ESC[0m INFO 06-12 14:42:33 model_runner.py:924] Graph capturing finished in 11 secs.ESC[32m [repeated 2x across cluster]ESC[0m
same problem with vllm==0.5.0.post1
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-17 14:51:19 gptq_marlin.py:134] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 06-17 14:51:19 config.py:628] Defaulting to use mp for distributed inference
INFO 06-17 14:51:19 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/data/pretrained_models/Qwen2-72B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/data/pretrained_models/Qwen2-72B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=5000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/data/pretrained_models/Qwen2-72B-Instruct-GPTQ-Int4)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[1;36m(VllmWorkerProcess pid=1078909)[0;0m INFO 06-17 14:51:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=1078911)[0;0m INFO 06-17 14:51:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=1078910)[0;0m INFO 06-17 14:51:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=1078911)[0;0m INFO 06-17 14:51:22 utils.py:645] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=1078909)[0;0m INFO 06-17 14:51:22 utils.py:645] Found nccl from library libnccl.so.2
INFO 06-17 14:51:22 utils.py:645] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=1078910)[0;0m INFO 06-17 14:51:22 utils.py:645] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=1078911)[0;0m INFO 06-17 14:51:22 pynccl.py:63] vLLM is using nccl==2.20.5
[1;36m(VllmWorkerProcess pid=1078909)[0;0m INFO 06-17 14:51:22 pynccl.py:63] vLLM is using nccl==2.20.5
[1;36m(VllmWorkerProcess pid=1078910)[0;0m INFO 06-17 14:51:22 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 06-17 14:51:22 pynccl.py:63] vLLM is using nccl==2.20.5
WARNING 06-17 14:51:22 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=1078910)[0;0m WARNING 06-17 14:51:22 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=1078911)[0;0m WARNING 06-17 14:51:22 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=1078909)[0;0m WARNING 06-17 14:51:22 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=1078910)[0;0m INFO 06-17 14:51:26 model_runner.py:160] Loading model weights took 9.7410 GB
[1;36m(VllmWorkerProcess pid=1078909)[0;0m INFO 06-17 14:51:26 model_runner.py:160] Loading model weights took 9.7410 GB
[1;36m(VllmWorkerProcess pid=1078911)[0;0m INFO 06-17 14:51:26 model_runner.py:160] Loading model weights took 9.7410 GB
INFO 06-17 14:51:26 model_runner.py:160] Loading model weights took 9.7410 GB
INFO 06-17 14:51:33 distributed_gpu_executor.py:56] # GPU blocks: 9267, # CPU blocks: 3276
Server start!
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete, Traceback (most recent call last):
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/worker/worker.py", line 294, in start_worker_execution_loop
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] while self._execute_model_non_driver():
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/worker/worker.py", line 303, in _execute_model_non_driver
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] data = broadcast_tensor_dict(src=0)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/distributed/parallel_state.py", line 400, in broadcast_tensor_dict
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] torch.distributed.broadcast_object_list(recv_metadata_list,
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] broadcast(object_sizes_tensor, src=src, group=group)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] work.wait()
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete, Traceback (most recent call last):
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/worker/worker.py", line 294, in start_worker_execution_loop
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] while self._execute_model_non_driver():
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/worker/worker.py", line 303, in _execute_model_non_driver
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] data = broadcast_tensor_dict(src=0)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/distributed/parallel_state.py", line 400, in broadcast_tensor_dict
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] torch.distributed.broadcast_object_list(recv_metadata_list,
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] broadcast(object_sizes_tensor, src=src, group=group)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] work.wait()
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete, Traceback (most recent call last):
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/worker/worker.py", line 294, in start_worker_execution_loop
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] while self._execute_model_non_driver():
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/worker/worker.py", line 303, in _execute_model_non_driver
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] data = broadcast_tensor_dict(src=0)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/lijinghui/vllm/vllm/distributed/parallel_state.py", line 400, in broadcast_tensor_dict
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] torch.distributed.broadcast_object_list(recv_metadata_list,
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] broadcast(object_sizes_tensor, src=src, group=group)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] work.wait()
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]
[2024-06-17 15:38:08 +0800] [1078766] [ERROR] Exception in ASGI application
Traceback (most recent call last):
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
^^^^^^^^^^^^^^^^^^^
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/lijinghui/online_chat/chat_fastapi_vllm.py", line 171, in predict
else:
^^
File "/data/lijinghui/online_chat/chat_fastapi_vllm.py", line 97, in inference
def inference(question, res):
^^^^^^^^^^^^^
File "/data/lijinghui/online_chat/chat_fastapi_vllm.py", line 88, in run
query = build_query(kwargs["model_path"], tokenizer, question, history)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/lijinghui/vllm/vllm/utils.py", line 699, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/data/lijinghui/vllm/vllm/entrypoints/llm.py", line 304, in generate
outputs = self._run_engine(use_tqdm=use_tqdm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/lijinghui/vllm/vllm/entrypoints/llm.py", line 556, in _run_engine
step_outputs = self.llm_engine.step()
^^^^^^^^^^^^^^^^^^^^^^
File "/data/lijinghui/vllm/vllm/engine/llm_engine.py", line 776, in step
output = self.model_executor.execute_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/lijinghui/vllm/vllm/executor/distributed_gpu_executor.py", line 76, in execute_model
return self._driver_execute_model(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/lijinghui/vllm/vllm/executor/multiproc_gpu_executor.py", line 84, in _driver_execute_model
return self.driver_worker.execute_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/data/lijinghui/vllm/vllm/worker/worker.py", line 272, in execute_model
broadcast_tensor_dict(data, src=0)
File "/data/lijinghui/vllm/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/lijinghui/vllm/vllm/distributed/parallel_state.py", line 374, in broadcast_tensor_dict
torch.distributed.broadcast_object_list([metadata_list],
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [192.168.12.71]:51841
when does the timeout happen? is it because the vLLM instance is idle for too long?
when does the timeout happen? Is it because the vLLM instance is idle for too long?
Yes, I need to make a request every 1800s in the background to keep it alive. But I have never encountered this problem before.
when does the timeout happen? is it because the vLLM instance is idle for too long?
i run a llm server in test environment,there is no request in night. i find vllm and local model time out in tomorrow . Cause service exception
cc @njhill
I haven't seen this before but will try to reproduce. Does it happen every time if starting a server and not then sending any requests for > 30 minutes?
@JaheimLee I've been unable to reproduce this so far. Both waiting for > 30 min after the server starts, and then waiting after making a request. Have you made any custom changes? Are you using the out-of-the-box OpenAI server?
I have the same issue with vllm 0.5.0.post1
! When using the LLM-class:
from vllm import LLM
model = LLM(model=model_name, # e.g. mistralai/Mixtral-8x7B-Instruct-v0.1
tokenizer=self.model_name, # HF tokenizer mistralai/Mixtral-8x7B-Instruct-v0.1
trust_remote_code=False,
dtype=dtype, # torch.bfloat16
tensor_parallel_size=tensor_parallel_size, # 4
gpu_memory_utilization=gpu_memory_utilization, # 0.9
distributed_executor_backend="ray",
max_model_len=context_window)
After 30 min of no requests to my API I get the timed out error:
...
(RayWorkerWrapper pid=3171083) ERROR 06-20 14:03:38 worker_base.py:148] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
...
ERROR 06-24 07:07:08 worker_base.py:148] RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.2]:27025
@JaheimLee I've been unable to reproduce this so far. Both waiting for > 30 min after the server starts, and then waiting after making a request. Have you made any custom changes? Are you using the out-of-the-box OpenAI server?
Same as @enkiid . And I also got a warning using the latest source code [WARNING 06-24 20:44:26 shm_broadcast.py:214] No available block found in 60 second.
@JaheimLee when you use the latest source code, does it still timeout after 30 mins? The warning you see is expected, it literally just tells you there are no requests in 60s. I assume https://github.com/vllm-project/vllm/pull/5399 should fix the 30min timeout issue.
@JaheimLee when you use the latest source code, does it still timeout after 30 mins? The warning you see is expected, it literally just tells you there are no requests in 60s. I assume #5399 should fix the 30min timeout issue.
seems no timeout error now
closing as #5399 should fix this.
The problem is that the workers should never be waiting on the broadcast operation indefinitely so I don't think #5399 will fix it. They'll just be waiting in an shm broadcast operation rather than torch.distributed.
I'm hoping that https://github.com/vllm-project/vllm/pull/5987 will address it though.
this blocks our Prod work (reproducible w/ 0.5.1), pls consider prioritizing cc @WoosukKwon
@nightflight-dk could you provide more details about how you are running vLLM, how the problem is manifesting in your case, and confirm that you are on 0.5.1 (or later)? This problem should be fixed by https://github.com/vllm-project/vllm/pull/5987 which is in 0.5.1
hi @njhill, sure. tensor parallel deployment on DGX clusters w/ 8xA100 GPUs - Azure ML managed endpoints. vllm 0.5.0.post1 - the deployment failed to process an incoming request with above symptoms after a period of inactivity. about to retry with 0.5.2, thank you for your efforts
Thanks @nightflight-dk, yes as mentioned this issue would be expected on 0.5.0.post1. And I assume you're using LLM
or LLMEngine
rather than AsyncLLMEngine
or the OpenAI server?
Correct @njhill, LLMEngine (would like to switch to AsyncLLMEngine from Triton within a month).
Unfortunately switch to 0.5.2 (again with punica and tensor parallelism enabled) hangs loading a 7B model, import warning points to https://stackoverflow.com/questions/65120136/lib64-libc-so-6-version-glibc-2-32-not-found
libc6=2.31 cuda12.2 nvidia-driver=535.183 nccl=2.20.5 (ubuntu 20)
Meanwhile switch to 0.5.1 appears blocked by below:
[rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 7] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f699a57a897 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f699a52ab25 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f699a9a7718 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f694e4738e6 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f694e4779e8 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f694e47d05c in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f694e47ddcc in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f699b0b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f699e9b2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f699ea44850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
nranks 8 cudaDev 3 nvmlDev 3 busId 400000 commId 0xadd71877c1f309c6 - Init COMPLETE
My production env is the same as @nightflight-dk and my code env is v0.5.2 During the prefilling stage, it turns out the error which is not quite similar to Damian's above, but the result is my vllm env hangs over. the prompt length is about 128K, and my model is qwen2-72b, maybe my prompt is too long which causes the timout of nccl collective operation?
(VllmWorkerProcess pid=143) WARNING 07-19 17:03:17 shm_broadcast.py:394] No available block found in 60 second.
(VllmWorkerProcess pid=147) WARNING 07-19 17:03:17 shm_broadcast.py:394] No available block found in 60 second.
[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7402, OpType=GATHER, NumelIn=57024, NumelOut=0, Timeout(ms)=600000) ran for 600008 milliseconds before timing out.
[rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 7] Timeout at NCCL work: 7402, last enqueued NCCL work: 7402, last completed NCCL work: 7401.
[rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7402, OpType=GATHER, NumelIn=57024, NumelOut=0, Timeout(ms)=600000) ran for 600008 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4c7bacf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4c2f84bc62 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4c2f850a80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4c2f851dcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f4c7b2b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f4c7c7b6ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f4c7c848850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Can you help me with this problem? Thanks. @njhill @youkaichao
Ubuntu 24.04 OK
libc6==2.39-0ubuntu8.2,
CUDA 12.4
NCCL==2.20.5,
GPUs=2x4090,
lora_enabled = True,
tensor_parallel_size=2
vllm==0.5.2
Ubuntu 22.04 OK
libc6==2.35-0ubuntu3.8,
CUDA 12.4
NCCL==2.20.5,
GPUs=2x4090,
lora_enabled = True,
tensor_parallel_size=2
vllm==0.5.2
issues point to ubuntu 20.04 w/ old libc6. Working Ubuntu 22.04 image deployed on a docker host w/ 20.04 (unfortunately current default w/ some cloud providers) seems affected as well.
@ehuaa leveraging tensor parallelism inside docker we're exposing ourselves as it turns out, to two issues: glibc conflicts and (in your case I believe) assigned shared memory limits. If you control the docker host, you're luckier than me. Bumping the memory at the time of running the container is a matter of parameter to docker run. Might help or you might deal with glibc next like me. Can you confirm the OS and version of libc6 for the host and container in your setup? @njhill Nick kindly please confirm the minimum version of glibc required to run vllm 0.5.2? I am hoping it's <=0.31. We can continue part of separate issue. thanks
@nightflight-dk I'm actually not sure, it would be best to open a dedicated issue related to that (or use the other one you linked from if that's similar).
@ehuaa your issue also looks like something different from the original one. I'd suggest to open an issue specifically for that, with more details including more of the log output. There is detailed method tracing that can be enabled to help debug these kind of cuda/nccl issues.
I'm going to close this one since I'm fairly certain now the original reported problem with the distributed timeout is fixed (in 0.5.1 onward).
Seeing this issue again when sending multiple asynchronous requests.
Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
I am running openai server using vllm/vllm-openai:v0.5.3.post1 with the following arguments:
--model llava-hf/llava-v1.6-34b-hf --tensor-parallel-size 2 --distributed-executor-backend=ray --gpu-memory-utilization=.9
Your current environment
PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.29.3 Libc version: glibc-2.31
Python version: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.8.0-43-generic-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A800 80GB PCIe GPU 1: NVIDIA A800 80GB PCIe Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] triton==2.3.0 [pip3] vllm_nccl_cu12==2.18.1.0.4.0 [conda] numpy 1.24.4 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] triton 2.3.0 pypi_0 pypi [conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.4.2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
🐛 Describe the bug
Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution. [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] Traceback (most recent call last): [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllms/vllm/worker/worker_base.py", line 138, in execute_method [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] return executor(*args, kwargs) [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, *kwargs) [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/worker/worker.py", line 286, in start_worker_execution_loop [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] while self._execute_model_non_driver(): [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/worker/worker.py", line 295, in _execute_model_non_driver [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] data = broadcast_tensor_dict(src=0) [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/distributed/communication_op.py", line 284, in broadcast_tensor_dict [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] torch.distributed.broadcast_object_list(recv_metadata_list, [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(args, kwargs) [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] broadcast(object_sizes_tensor, src=src, group=group) [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs) [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] work.wait() [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete