[Bug]: The vllm is disconnected after running for some time

zxcdsa45687 commented 4 months ago

Your current environment

PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.29.3 Libc version: glibc-2.31

Python version: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.8.0-43-generic-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A800 80GB PCIe GPU 1: NVIDIA A800 80GB PCIe Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] triton==2.3.0 [pip3] vllm_nccl_cu12==2.18.1.0.4.0 [conda] numpy 1.24.4 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] triton 2.3.0 pypi_0 pypi [conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.4.2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

🐛 Describe the bug

Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution. [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] Traceback (most recent call last): [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllms/vllm/worker/worker_base.py", line 138, in execute_method [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] return executor(*args, kwargs) [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, *kwargs) [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/worker/worker.py", line 286, in start_worker_execution_loop [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] while self._execute_model_non_driver(): [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/worker/worker.py", line 295, in _execute_model_non_driver [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] data = broadcast_tensor_dict(src=0) [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/distributed/communication_op.py", line 284, in broadcast_tensor_dict [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] torch.distributed.broadcast_object_list(recv_metadata_list, [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(args, kwargs) [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] broadcast(object_sizes_tensor, src=src, group=group) [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs) [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] work.wait() [36m(RayWorkerWrapper pid=3957362)[0m ERROR 05-28 16:05:41 worker_base.py:146] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete

thirteenflt commented 4 months ago

I run into the same issue with vllm=0.4.3 with only one nccl version nvidia-nccl-cu12==2.20.5.

754807266 commented 3 months ago

i meet same problem, Timed out waiting 1800000ms

ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution. ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] Traceback (most recent call last): ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/vllm/worker/worker_base.py", line 140, in execut e_method ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] return executor(*args, kwargs) ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decora te_context ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] return func(*args, *kwargs) ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/vllm/worker/worker.py", line 286, in start_worke r_execution_loop ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] while self._execute_model_non_driver(): ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/vllm/worker/worker.py", line 295, in _execute_mo del_non_driver ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] data = broadcast_tensor_dict(src=0) ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/vllm/distributed/communication_op.py", line 284, in broadcast_tensor_dict ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] torch.distributed.broadcast_object_list(recv_metadata_list, ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in w rapper ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] return func(args, kwargs) ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 264 9, in broadcast_object_list ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] broadcast(object_sizes_tensor, src=src, group=group) ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in w rapper ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] return func(*args, **kwargs) ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 214 4, in broadcast ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] work.wait() ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete ESC[36m(RayWorkerWrapper pid=37200)ESC[0m INFO 06-12 14:42:33 model_runner.py:924] Graph capturing finished in 11 secs.ESC[32m [repeated 2x across cluster]ESC[0m

JaheimLee commented 3 months ago

same problem with vllm==0.5.0.post1

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-17 14:51:19 gptq_marlin.py:134] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 06-17 14:51:19 config.py:628] Defaulting to use mp for distributed inference
INFO 06-17 14:51:19 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/data/pretrained_models/Qwen2-72B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/data/pretrained_models/Qwen2-72B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=5000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/data/pretrained_models/Qwen2-72B-Instruct-GPTQ-Int4)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[1;36m(VllmWorkerProcess pid=1078909)[0;0m INFO 06-17 14:51:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=1078911)[0;0m INFO 06-17 14:51:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=1078910)[0;0m INFO 06-17 14:51:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=1078911)[0;0m INFO 06-17 14:51:22 utils.py:645] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=1078909)[0;0m INFO 06-17 14:51:22 utils.py:645] Found nccl from library libnccl.so.2
INFO 06-17 14:51:22 utils.py:645] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=1078910)[0;0m INFO 06-17 14:51:22 utils.py:645] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=1078911)[0;0m INFO 06-17 14:51:22 pynccl.py:63] vLLM is using nccl==2.20.5
[1;36m(VllmWorkerProcess pid=1078909)[0;0m INFO 06-17 14:51:22 pynccl.py:63] vLLM is using nccl==2.20.5
[1;36m(VllmWorkerProcess pid=1078910)[0;0m INFO 06-17 14:51:22 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 06-17 14:51:22 pynccl.py:63] vLLM is using nccl==2.20.5
WARNING 06-17 14:51:22 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=1078910)[0;0m WARNING 06-17 14:51:22 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=1078911)[0;0m WARNING 06-17 14:51:22 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=1078909)[0;0m WARNING 06-17 14:51:22 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=1078910)[0;0m INFO 06-17 14:51:26 model_runner.py:160] Loading model weights took 9.7410 GB
[1;36m(VllmWorkerProcess pid=1078909)[0;0m INFO 06-17 14:51:26 model_runner.py:160] Loading model weights took 9.7410 GB
[1;36m(VllmWorkerProcess pid=1078911)[0;0m INFO 06-17 14:51:26 model_runner.py:160] Loading model weights took 9.7410 GB
INFO 06-17 14:51:26 model_runner.py:160] Loading model weights took 9.7410 GB
INFO 06-17 14:51:33 distributed_gpu_executor.py:56] # GPU blocks: 9267, # CPU blocks: 3276
Server start!
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete, Traceback (most recent call last):
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/worker/worker.py", line 294, in start_worker_execution_loop
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     while self._execute_model_non_driver():
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/worker/worker.py", line 303, in _execute_model_non_driver
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     data = broadcast_tensor_dict(src=0)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/distributed/parallel_state.py", line 400, in broadcast_tensor_dict
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     torch.distributed.broadcast_object_list(recv_metadata_list,
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     broadcast(object_sizes_tensor, src=src, group=group)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     work.wait()
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[1;36m(VllmWorkerProcess pid=1078909)[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] 
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete, Traceback (most recent call last):
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/worker/worker.py", line 294, in start_worker_execution_loop
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     while self._execute_model_non_driver():
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/worker/worker.py", line 303, in _execute_model_non_driver
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     data = broadcast_tensor_dict(src=0)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/distributed/parallel_state.py", line 400, in broadcast_tensor_dict
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     torch.distributed.broadcast_object_list(recv_metadata_list,
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     broadcast(object_sizes_tensor, src=src, group=group)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     work.wait()
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[1;36m(VllmWorkerProcess pid=1078911)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] 
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete, Traceback (most recent call last):
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/worker/worker.py", line 294, in start_worker_execution_loop
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     while self._execute_model_non_driver():
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/worker/worker.py", line 303, in _execute_model_non_driver
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     data = broadcast_tensor_dict(src=0)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/distributed/parallel_state.py", line 400, in broadcast_tensor_dict
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     torch.distributed.broadcast_object_list(recv_metadata_list,
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     broadcast(object_sizes_tensor, src=src, group=group)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     work.wait()
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[1;36m(VllmWorkerProcess pid=1078910)[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] 
[2024-06-17 15:38:08 +0800] [1078766] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/online_chat/chat_fastapi_vllm.py", line 171, in predict
    else:
        ^^
  File "/data/lijinghui/online_chat/chat_fastapi_vllm.py", line 97, in inference
    def inference(question, res):
                    ^^^^^^^^^^^^^
  File "/data/lijinghui/online_chat/chat_fastapi_vllm.py", line 88, in run
    query = build_query(kwargs["model_path"], tokenizer, question, history)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/utils.py", line 699, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/entrypoints/llm.py", line 304, in generate
    outputs = self._run_engine(use_tqdm=use_tqdm)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/entrypoints/llm.py", line 556, in _run_engine
    step_outputs = self.llm_engine.step()
                   ^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/engine/llm_engine.py", line 776, in step
    output = self.model_executor.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/executor/distributed_gpu_executor.py", line 76, in execute_model
    return self._driver_execute_model(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/executor/multiproc_gpu_executor.py", line 84, in _driver_execute_model
    return self.driver_worker.execute_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/worker/worker.py", line 272, in execute_model
    broadcast_tensor_dict(data, src=0)
  File "/data/lijinghui/vllm/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
    return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/distributed/parallel_state.py", line 374, in broadcast_tensor_dict
    torch.distributed.broadcast_object_list([metadata_list],
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [192.168.12.71]:51841

youkaichao commented 3 months ago

when does the timeout happen? is it because the vLLM instance is idle for too long?

JaheimLee commented 3 months ago

when does the timeout happen? Is it because the vLLM instance is idle for too long?

Yes, I need to make a request every 1800s in the background to keep it alive. But I have never encountered this problem before.

754807266 commented 3 months ago

when does the timeout happen? is it because the vLLM instance is idle for too long?

i run a llm server in test environment，there is no request in night. i find vllm and local model time out in tomorrow . Cause service exception

youkaichao commented 3 months ago

cc @njhill

njhill commented 3 months ago

I haven't seen this before but will try to reproduce. Does it happen every time if starting a server and not then sending any requests for > 30 minutes?

njhill commented 3 months ago

@JaheimLee I've been unable to reproduce this so far. Both waiting for > 30 min after the server starts, and then waiting after making a request. Have you made any custom changes? Are you using the out-of-the-box OpenAI server?

enkiid commented 3 months ago

I have the same issue with vllm 0.5.0.post1! When using the LLM-class:

from vllm import LLM
model = LLM(model=model_name,  # e.g. mistralai/Mixtral-8x7B-Instruct-v0.1
            tokenizer=self.model_name,  # HF tokenizer mistralai/Mixtral-8x7B-Instruct-v0.1
            trust_remote_code=False,
            dtype=dtype,  # torch.bfloat16
            tensor_parallel_size=tensor_parallel_size,  # 4
            gpu_memory_utilization=gpu_memory_utilization,  # 0.9
            distributed_executor_backend="ray",
            max_model_len=context_window)

After 30 min of no requests to my API I get the timed out error:

...
(RayWorkerWrapper pid=3171083) ERROR 06-20 14:03:38 worker_base.py:148] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete 
...
ERROR 06-24 07:07:08 worker_base.py:148] RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.2]:27025

JaheimLee commented 3 months ago

@JaheimLee I've been unable to reproduce this so far. Both waiting for > 30 min after the server starts, and then waiting after making a request. Have you made any custom changes? Are you using the out-of-the-box OpenAI server?

Same as @enkiid . And I also got a warning using the latest source code [WARNING 06-24 20:44:26 shm_broadcast.py:214] No available block found in 60 second.

youkaichao commented 3 months ago

@JaheimLee when you use the latest source code, does it still timeout after 30 mins? The warning you see is expected, it literally just tells you there are no requests in 60s. I assume https://github.com/vllm-project/vllm/pull/5399 should fix the 30min timeout issue.

JaheimLee commented 3 months ago

@JaheimLee when you use the latest source code, does it still timeout after 30 mins? The warning you see is expected, it literally just tells you there are no requests in 60s. I assume #5399 should fix the 30min timeout issue.

seems no timeout error now

youkaichao commented 3 months ago

closing as #5399 should fix this.

njhill commented 3 months ago

The problem is that the workers should never be waiting on the broadcast operation indefinitely so I don't think #5399 will fix it. They'll just be waiting in an shm broadcast operation rather than torch.distributed.

I'm hoping that https://github.com/vllm-project/vllm/pull/5987 will address it though.

nightflight-dk commented 2 months ago

this blocks our Prod work (reproducible w/ 0.5.1), pls consider prioritizing cc @WoosukKwon

njhill commented 2 months ago

@nightflight-dk could you provide more details about how you are running vLLM, how the problem is manifesting in your case, and confirm that you are on 0.5.1 (or later)? This problem should be fixed by https://github.com/vllm-project/vllm/pull/5987 which is in 0.5.1

nightflight-dk commented 2 months ago

hi @njhill, sure. tensor parallel deployment on DGX clusters w/ 8xA100 GPUs - Azure ML managed endpoints. vllm 0.5.0.post1 - the deployment failed to process an incoming request with above symptoms after a period of inactivity. about to retry with 0.5.2, thank you for your efforts

njhill commented 2 months ago

Thanks @nightflight-dk, yes as mentioned this issue would be expected on 0.5.0.post1. And I assume you're using LLM or LLMEngine rather than AsyncLLMEngine or the OpenAI server?

nightflight-dk commented 2 months ago

Correct @njhill, LLMEngine (would like to switch to AsyncLLMEngine from Triton within a month).

Unfortunately switch to 0.5.2 (again with punica and tensor parallelism enabled) hangs loading a 7B model, import warning points to https://stackoverflow.com/questions/65120136/lib64-libc-so-6-version-glibc-2-32-not-found

libc6=2.31 cuda12.2 nvidia-driver=535.183 nccl=2.20.5 (ubuntu 20)

Meanwhile switch to 0.5.1 appears blocked by below:

[rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 7] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f699a57a897 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f699a52ab25 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f699a9a7718 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f694e4738e6 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f694e4779e8 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f694e47d05c in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f694e47ddcc in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f699b0b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f699e9b2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f699ea44850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

 nranks 8 cudaDev 3 nvmlDev 3 busId 400000 commId 0xadd71877c1f309c6 - Init COMPLETE

ehuaa commented 2 months ago

My production env is the same as @nightflight-dk and my code env is v0.5.2 During the prefilling stage, it turns out the error which is not quite similar to Damian's above, but the result is my vllm env hangs over. the prompt length is about 128K, and my model is qwen2-72b, maybe my prompt is too long which causes the timout of nccl collective operation?

(VllmWorkerProcess pid=143) WARNING 07-19 17:03:17 shm_broadcast.py:394] No available block found in 60 second. 
(VllmWorkerProcess pid=147) WARNING 07-19 17:03:17 shm_broadcast.py:394] No available block found in 60 second. 
[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7402, OpType=GATHER, NumelIn=57024, NumelOut=0, Timeout(ms)=600000) ran for 600008 milliseconds before timing out.
[rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 7] Timeout at NCCL work: 7402, last enqueued NCCL work: 7402, last completed NCCL work: 7401.
[rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7402, OpType=GATHER, NumelIn=57024, NumelOut=0, Timeout(ms)=600000) ran for 600008 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4c7bacf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4c2f84bc62 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4c2f850a80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4c2f851dcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f4c7b2b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f4c7c7b6ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f4c7c848850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Can you help me with this problem? Thanks. @njhill @youkaichao

nightflight-dk commented 2 months ago

Ubuntu 24.04 OK

libc6==2.39-0ubuntu8.2, 
CUDA 12.4
NCCL==2.20.5,
GPUs=2x4090,
lora_enabled = True, 
tensor_parallel_size=2
vllm==0.5.2

Ubuntu 22.04 OK

libc6==2.35-0ubuntu3.8,
CUDA 12.4   
NCCL==2.20.5,
GPUs=2x4090,
lora_enabled = True, 
tensor_parallel_size=2
vllm==0.5.2

issues point to ubuntu 20.04 w/ old libc6. Working Ubuntu 22.04 image deployed on a docker host w/ 20.04 (unfortunately current default w/ some cloud providers) seems affected as well.

nightflight-dk commented 2 months ago

@ehuaa leveraging tensor parallelism inside docker we're exposing ourselves as it turns out, to two issues: glibc conflicts and (in your case I believe) assigned shared memory limits. If you control the docker host, you're luckier than me. Bumping the memory at the time of running the container is a matter of parameter to docker run. Might help or you might deal with glibc next like me. Can you confirm the OS and version of libc6 for the host and container in your setup? @njhill Nick kindly please confirm the minimum version of glibc required to run vllm 0.5.2? I am hoping it's <=0.31. We can continue part of separate issue. thanks

njhill commented 2 months ago

@nightflight-dk I'm actually not sure, it would be best to open a dedicated issue related to that (or use the other one you linked from if that's similar).

@ehuaa your issue also looks like something different from the original one. I'd suggest to open an issue specifically for that, with more details including more of the log output. There is detailed method tracing that can be enabled to help debug these kind of cuda/nccl issues.

I'm going to close this one since I'm fairly certain now the original reported problem with the distributed timeout is fixed (in 0.5.1 onward).

ktrapeznikov commented 2 months ago

Seeing this issue again when sending multiple asynchronous requests. Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.

I am running openai server using vllm/vllm-openai:v0.5.3.post1 with the following arguments: --model llava-hf/llava-v1.6-34b-hf --tensor-parallel-size 2 --distributed-executor-backend=ray --gpu-memory-utilization=.9

vllm-project / vllm

[Bug]: The vllm is disconnected after running for some time #5084

Your current environment

🐛 Describe the bug