vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.61k stars 3.71k forks source link

[Bug]: Error happen in async_llm_engine when use multiple GPUs #3839

Open for-just-we opened 4 months ago

for-just-we commented 4 months ago

Your current environment

PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.11.8 (main, Feb 26 2024, 21:39:34) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-97-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000
GPU 4: NVIDIA RTX A6000
GPU 5: NVIDIA RTX A6000
GPU 6: NVIDIA RTX A6000
GPU 7: NVIDIA RTX A6000

Nvidia driver version: 535.161.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian 
Address sizes:                      43 bits physical, 48 bits virtual
CPU(s):                             128
On-line CPU(s) list:                0-127
Thread(s) per core:                 2
Core(s) per socket:                 32
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          AuthenticAMD  
CPU family:                         25
Model:                              1
Model name:                         AMD EPYC 7543 32-Core Processor
Stepping:                           1
Frequency boost:                    enabled
CPU MHz:                            1500.000
CPU max MHz:                        3737.8899
CPU min MHz:                        1500.0000
BogoMIPS:                           5599.97
Virtualization:                     AMD-V
L1d cache:                          2 MiB
L1i cache:                          2 MiB
L2 cache:                           32 MiB
L3 cache:                           512 MiB
NUMA node0 CPU(s):                  0-31,64-95
NUMA node1 CPU(s):                  32-63,96-127  
Vulnerability Gather data sampling: Not affected  
Vulnerability Itlb multihit:        Not affected  
Vulnerability L1tf:                 Not affected  
Vulnerability Mds:                  Not affected  
Vulnerability Meltdown:             Not affected  
Vulnerability Mmio stale data:      Not affected  
Vulnerability Retbleed:             Not affected  
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected  
Vulnerability Tsx async abort:      Not affected  
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm sme sev sev_es

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.1.2                    pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.3.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU1    NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU2    NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU3    NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    32-63,96-127    1               N/A
GPU5    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    32-63,96-127    1               N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    32-63,96-127    1               N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      32-63,96-127    1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

the command or running openai_server is

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server --tensor-parallel-size 4 --served-model-name Qwen1.5-72B-Chat --model ../Qwen1.5-72B-Chat --port 8989 --max-model-len 14500 --gpu-memory-utilization 0.96

🐛 Describe the bug

I query openai server with thread num of 10. At first, it was OK, however, after a little while, the server just shut down.

error trace is: bash

(RayWorkerVllm pid=1219581) [E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=232109,
 OpType=ALLREDUCE, NumelIn=77111296, NumelOut=77111296, Timeout(ms)=1800000) ran for 1800849 milliseconds before timing out.
(RayWorkerVllm pid=1219581) [E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=76284, 
OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800669 milliseconds before timing out.
(RayWorkerVllm pid=1219208) [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature 
of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorkerVllm pid=1219208) [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
(RayWorkerVllm pid=1219208) [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog cau
ght collective operation timeout: WorkNCCL(SeqNum=232109, OpType=ALLREDUCE, NumelIn=77111296, NumelOut=77111296, Timeout(ms)=1800000) r
an for 1800455 milliseconds before timing out.
(RayWorkerVllm pid=1219208) [2024-04-04 00:59:27,638 E 1219208 1220051] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=232109, OpType=ALLREDUCE, NumelIn=77111296, NumelOut=77111296, Timeout(ms)=1800000) ran for 1800455 milliseconds before timing out.
(RayWorkerVllm pid=1219208) [2024-04-04 00:59:27,667 E 1219208 1220051] logging.cc:104: Stack trace:
(RayWorkerVllm pid=1219208)  /server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/site-packages/ray/_raylet.so(+0xfe93da) [0x7fe25ddfa3da] ray::operator<<()
(RayWorkerVllm pid=1219208) /server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/site-packages/ray/_raylet.so(+0xfebb18) [0x7fe25ddfcb18] ray::TerminateHandler()
(RayWorkerVllm pid=1219208) /server9/cbj/programming/anaconda3/envs/vllm_server/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fe25cc8f35a] __cxxabiv1::__terminate()
(RayWorkerVllm pid=1219208) /server9/cbj/programming/anaconda3/envs/vllm_server/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fe25cc8f3c5]
(RayWorkerVllm pid=1219208) /server9/cbj/programming/anaconda3/envs/vllm_server/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fe25cc8f34f]
(RayWorkerVllm pid=1219208) /server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(+0xc86f5b) [0x7fdbb912df5b] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorkerVllm pid=1219208) /server9/cbj/programming/anaconda3/envs/vllm_server/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fe25ccb9bf4] execute_native_thread_routine
(RayWorkerVllm pid=1219208) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fe25f325609] start_thread
(RayWorkerVllm pid=1219208) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fe25f0f0353] __clone
(RayWorkerVllm pid=1219208)
(RayWorkerVllm pid=1219208) *** SIGABRT received at time=1712163567 on cpu 21 ***
(RayWorkerVllm pid=1219208) PC: @     0x7fe25f01400b  (unknown)  raise
(RayWorkerVllm pid=1219208)     @     0x7fe25f331420       3792  (unknown)
(RayWorkerVllm pid=1219208)     @     0x7fe25cc8f35a  (unknown)  __cxxabiv1::__terminate()
(RayWorkerVllm pid=1219208)     @     0x7fe25cc8f070  (unknown)  (unknown)
(RayWorkerVllm pid=1219208) [2024-04-04 00:59:27,667 E 1219208 1220051] logging.cc:361: *** SIGABRT received at time=1712163567 on cpu 21 ***
(RayWorkerVllm pid=1219208) [2024-04-04 00:59:27,667 E 1219208 1220051] logging.cc:361: PC: @     0x7fe25f01400b  (unknown)  raise
(RayWorkerVllm pid=1219208) [2024-04-04 00:59:27,667 E 1219208 1220051] logging.cc:361:     @     0x7fe25f331420       3792  (unknown)
(RayWorkerVllm pid=1219208) [2024-04-04 00:59:27,667 E 1219208 1220051] logging.cc:361:     @     0x7fe25cc8f35a  (unknown)  __cxxabiv1::__terminate()
(RayWorkerVllm pid=1219208) [2024-04-04 00:59:27,668 E 1219208 1220051] logging.cc:361:     @     0x7fe25cc8f070  (unknown)  (unknown)
(RayWorkerVllm pid=1219208) Fatal Python error: Aborted
(RayWorkerVllm pid=1219208)
(RayWorkerVllm pid=1219208)
(RayWorkerVllm pid=1219208) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, cupy_backends.cuda.api._runtime_enum, cupy_backends.cuda.api.runtime, cupy_backends.cuda.stream, cupy_backends.cuda.libs.cublas, cupy_backends.cuda.libs.cusolver, cupy_backends.cuda._softlink, cupy_backends.cuda.libs.cusparse, cupy._util, cupy.cuda.device, fastrlock.rlock, cupy.cuda.memory_hook, cupy.cuda.graph, cupy.cuda.stream, cupy_backends.cuda.api._driver_enum, cupy_backends.cuda.api.driver, cupy.cuda.memory, cupy._core.internal, cupy._core._carray, cupy.cuda.texture, cupy.cuda.function, cupy_backends.cuda.libs.nvrtc, cupy.cuda.jitify, cupy.cuda.pinned_memory, cupy_backends.cuda.libs.curand, cupy_backends.cuda.libs.profiler, cupy.cuda.common, cupy.cuda.cub, cupy_backends.cuda.libs.nvtx, cupy.cuda.thrust, cupy._core._dtype, cupy._core._scalar, cupy._core._accelerator, cupy._core._memory_range, cupy._core._fusion_thread_local, cupy._core._kernel, cupy._core._routines_manipulation, cupy._core._optimize_config, cupy._core._cub_reduction, cupy._core._reduction, cupy._core._routines_binary, cupy._core._routines_math, cupy._core._routines_indexing, cupy._core._routines_linalg, cupy._core._routines_logic, cupy._core._routines_sorting, cupy._core._routines_statistics, cupy._core.dlpack, cupy._core.flags, cupy._core.core, cupy._core._fusion_variable, cupy._core._fusion_trace, cupy._core._fusion_kernel, cupy._core.new_fusion, cupy._core.fusion, cupy._core.raw, cupyx.cusolver, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, cupy.cuda.cufft, cupy.fft._cache, cupy.fft._callback, cupy.random._generator_api, cupy.random._bit_generator, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, cupy.lib._polynomial, cupy_backends.cuda.libs.nccl, regex._regex, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 150)
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs 
for the dead worker. RayTask ID: ffffffffffffffffee97aab52aa28d67515e978d01000000 Worker ID: f270860e50a7041eb18fb5766d00787785a6ad6594
7c6f1d3ac89be4 Node ID: 35942d5221ce2280d6b6106db01a2c1e627118aab3db5638df891f8f Worker IP address: 10.96.184.35 Worker port: 41841 Wor
ker PID: 1219208 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of fi
le. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(RayWorkerVllm pid=1219581) INFO 04-04 00:13:58 model_runner.py:756] Graph capturing finished in 15 secs. [repeated 2x across cluster]
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f06ad2a2fc0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f0045e7f050>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f06ad2a2fc0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f0045e7f050>)>
Traceback (most recent call last):
  File "/server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
    task.result()
  File "/server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 414, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
                               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 393, in engine_step
    request_outputs = await self.engine.step_async()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 189, in step_async
    all_outputs = await self._run_workers_async(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 276, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/asyncio/tasks.py", line 694, in _wrap_awaitable
    return (yield from awaitable.__await__())
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: RayWorkerVllm
        actor_id: ee97aab52aa28d67515e978d01000000
        pid: 1219208
        namespace: aa510143-0ad2-4900-8bc2-8c914fcf8c84
        ip: 10.96.184.35
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    raise exc
  File "/server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 04-04 00:59:28 async_llm_engine.py:133] Aborted request cmpl-d2cf52407f80409f815982cd5f109f97.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=76304, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800103 milliseconds before timing out.
INFO 04-04 00:59:29 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 61.7%, CPU KV cache usage: 0.0%
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=76304, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800103 milliseconds before timing out.
[2024-04-04 00:59:29,131 E 1212023 1219862] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=76304, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800103 milliseconds before timing out.
[2024-04-04 00:59:29,148 E 1212023 1219862] logging.cc:104: Stack trace:
 /server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/site-packages/ray/_raylet.so(+0xfe93da) [0x7f06ac9ae3da] ray::opera
tor<<()
/server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/site-packages/ray/_raylet.so(+0xfebb18) [0x7f06ac9b0b18] ray::TerminateHandler()
/server9/cbj/programming/anaconda3/envs/vllm_server/bin/../lib/libstdc++.so.6(+0xb135a) [0x7f07c0dda35a] __cxxabiv1::__terminate()
/server9/cbj/programming/anaconda3/envs/vllm_server/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7f07c0dda3c5]
/server9/cbj/programming/anaconda3/envs/vllm_server/bin/../lib/libstdc++.so.6(+0xb134f) [0x7f07c0dda34f]
/server9/cbj/programming/anaconda3/envs/vllm_server/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(+0xc86f5b) [0x7f077c367f5b] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/server9/cbj/programming/anaconda3/envs/vllm_server/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7f07c0e04bf4] execute_native_thread_routine
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f07fcce6609] start_thread
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f07fcab1353] __clone

*** SIGABRT received at time=1712163569 on cpu 81 ***
PC: @     0x7f07fc9d500b  (unknown)  raise
    @     0x7f07fccf2420       3792  (unknown)
    @     0x7f07c0dda35a  (unknown)  __cxxabiv1::__terminate()
    @     0x7f07c0dda070  (unknown)  (unknown)
[2024-04-04 00:59:29,148 E 1212023 1219862] logging.cc:361: *** SIGABRT received at time=1712163569 on cpu 81 ***
[2024-04-04 00:59:29,148 E 1212023 1219862] logging.cc:361: PC: @     0x7f07fc9d500b  (unknown)  raise
[2024-04-04 00:59:29,148 E 1212023 1219862] logging.cc:361:     @     0x7f07fccf2420       3792  (unknown)
[2024-04-04 00:59:29,148 E 1212023 1219862] logging.cc:361:     @     0x7f07c0dda35a  (unknown)  __cxxabiv1::__terminate()
[2024-04-04 00:59:29,149 E 1212023 1219862] logging.cc:361:     @     0x7f07c0dda070  (unknown)  (unknown)
Fatal Python error: Aborted

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, yaml._yaml, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, markupsafe._speedups, cupy_backends.cuda.api._runtime_enum, cupy_backends.cuda.api.runtime, cupy_backends.cuda.stream, cupy_backends.cuda.libs.cublas, cupy_backends.cuda.libs.cusolver, cupy_backends.cuda._softlink, cupy_backends.cuda.libs.cusparse, cupy._util, cupy.cuda.device, fastrlock.rlock, cupy.cuda.memory_hook, cupy.cuda.graph, cupy.cuda.stream, cupy_backends.cuda.api._driver_enum, cupy_backends.cuda.api.driver, cupy.cuda.memory, cupy._core.internal, cupy._core._carray, cupy.cuda.texture, cupy.cuda.function, cupy_backends.cuda.libs.nvrtc, cupy.cuda.jitify, cupy.cuda.pinned_memory, cupy_backends.cuda.libs.curand, cupy_backends.cuda.libs.profiler, cupy.cuda.common, cupy.cuda.cub, cupy_backends.cuda.libs.nvtx, cupy.cuda.thrust, cupy._core._dtype, cupy._core._scalar, cupy._core._accelerator, cupy._core._memory_range, cupy._core._fusion_thread_local, cupy._core._kernel, cupy._core._routines_manipulation, cupy._core._optimize_config, cupy._core._cub_reduction, cupy._core._reduction, cupy._core._routines_binary, cupy._core._routines_math, cupy._core._routines_indexing, cupy._core._routines_linalg, cupy._core._routines_logic, cupy._core._routines_sorting, cupy._core._routines_statistics, cupy._core.dlpack, cupy._core.flags, cupy._core.core, cupy._core._fusion_variable, cupy._core._fusion_trace, cupy._core._fusion_kernel, cupy._core.new_fusion, cupy._core.fusion, cupy._core.raw, cupyx.cusolver, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, cupy.cuda.cufft, cupy.fft._cache, cupy.fft._callback, cupy.random._generator_api, cupy.random._bit_generator, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, cupy.lib._polynomial, cupy_backends.cuda.libs.nccl, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct, httptools.parser.parser, httptools.parser.url_parser, websockets.speedups (total: 161)
Aborted (core dumped)
yudian0504 commented 4 months ago

+1

ericzhou571 commented 4 months ago

I've encountered the same issue with versions 0.4.0 and 0.4.0 post1, where the problem persists across multiple GPU setups. In contrast, version 0.3.3 continues to perform as expected, even after handling tens of thousands of requests over several days. I use the qwen1.5-72b model with a tensor parallelism (tp) of 4. It appears that the bug was introduced in the transition from 0.3.3 to 0.4.0 and remains unresolved in the 0.4.0 post1 update.

for-just-we commented 4 months ago

I've encountered the same issue with versions 0.4.0 and 0.4.0 post1, where the problem persists across multiple GPU setups. In contrast, version 0.3.3 continues to perform as expected, even after handling tens of thousands of requests over several days. I use the qwen1.5-72b model with a tensor parallelism (tp) of 4. It appears that the bug was introduced in the transition from 0.3.3 to 0.4.0 and remains unresolved in the 0.4.0 post1 update.

I retry running qwen1.5-32b-chat with 2 A6000 in vllm-0.3.3. The error still happens, and after I retry by degrading vllm to 0.3.2. I encounter another error. There is always OutOfMemoryError accomplished by AsyncError. Maybe I should try other deployment tool in multi-GPU environment

linchen111 commented 4 months ago

I've encountered the same issue with versions 0.4.0 and 0.4.0 post1, where the problem persists across multiple GPU setups. In contrast, version 0.3.3 continues to perform as expected, even after handling tens of thousands of requests over several days. I use the qwen1.5-72b model with a tensor parallelism (tp) of 4. It appears that the bug was introduced in the transition from 0.3.3 to 0.4.0 and remains unresolved in the 0.4.0 post1 update.

+1

for-just-we commented 4 months ago

I've encountered the same issue with versions 0.4.0 and 0.4.0 post1, where the problem persists across multiple GPU setups. In contrast, version 0.3.3 continues to perform as expected, even after handling tens of thousands of requests over several days. I use the qwen1.5-72b model with a tensor parallelism (tp) of 4. It appears that the bug was introduced in the transition from 0.3.3 to 0.4.0 and remains unresolved in the 0.4.0 post1 update.

Currently, I tried a new deployment tool sglang, which seems fine in multi-GPU environments, and supports openai API in clients. One thing could be improved is that compared with models deployed by vllm. You need to specify max_tokens in client request.

ericg108 commented 4 months ago

same here

hmellor commented 4 months ago

The process is killed by SIGKILL by OOM killer due to high memory usage.

Is your host running out of RAM and killing the Ray workers?

linchen111 commented 4 months ago

The process is killed by SIGKILL by OOM killer due to high memory usage.

Is your host running out of RAM and killing the Ray workers?

Mine works well in vllm-0.3.3 , using 2080Ti(22G) . Same machine, same request

changyuanzhangchina commented 3 months ago

Please refer to https://github.com/vllm-project/vllm/issues/4653

scutcyr commented 3 months ago

这个bug有人解决了吗?

for-just-we commented 3 months ago

这个bug有人解决了吗?

按前一个人说的参考 #4653 好像有用