vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.2k stars 3.13k forks source link

vLLM running on a Ray Cluster Hanging on Initializing #2826

Closed Kaotic3 closed 4 months ago

Kaotic3 commented 4 months ago

It isn't clear what is at fault here. Whether it be vLLM or Ray.

There is a thread here on the ray forums that outlines the issue, it is 16 days old, there is no reply to it.

https://discuss.ray.io/t/running-vllm-script-on-multi-node-cluster/13533

Taking from that thread, but this is identical for me.

2024-01-24 13:57:17,308 INFO worker.py:1540 – Connecting to existing Ray cluster at address: HOST_IP_ADDRESS…
2024-01-24 13:57:17,317 INFO worker.py:1715 – Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
INFO 01-24 13:57:39 llm_engine.py:70] Initializing an LLM engine with config: model=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)

But after that it hangs, and eventually quits.

I have exactly this same problem. The thread details the other points, that the "ray status" seems to show nodes working and communicating, that it stays like this for an age then eventually crashes with some error messages. Everything in that thread is identical to what is happening for me.

Unfortunately the Ray forums probably don't want to engage because it is vLLM - and I am concerned that vLLM won't want to engage because it is Ray.....

valentinp72 commented 4 months ago

Hi, Just started using vLLM two hours ago, and I had exactly the same issue. I managed to make it work by disabling NCCL_P2P. For that, I exported NCCL_P2P_DISABLE=1.

Let me know if this solves your issue as well :)

Kaotic3 commented 4 months ago

Thanks for the idea, I did try it but didn't work for me.

Same hanging issue but I went off for dinner and came back to this message:

(RayWorkerVllm pid=7722, ip=.123) [E socket.cpp:922] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.1.1, 55251)

Which I think is a new error message compared to the thread I linked - but googling didn't provide me with any great insight into fixing.

Your search - "RayWorkerVllm" The client socket has timed out after 1800s while trying to connect - did not match any documents.

Which is always a little impressive to be honest....

davidsyoung commented 4 months ago

I am experiencing the same at the moment. For me, it happens with GPTQ quantisation with tp=4.

I have tried the following settings / combinations of settings without any luck:

NCCL_P2P_DISABLE=1 disable_custom_all_reduce=True enforce_eager=True

Latest vLLM, compiled from source. It hangs at approx 12995GB VRAM on each card across 4x3090. 70b model llama2.

Finally hung at this after approx 1h:

[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
(RayWorkerVllm pid=1486) [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747549 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
[2024-02-10 00:06:27,464 E 13 1719] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
[2024-02-10 00:06:27,501 E 13 1719] logging.cc:104: Stack trace: 
 /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xfebb5a) [0x14af5a1ebb5a] ray::operator<<()
/opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xfee298) [0x14af5a1ee298] ray::TerminateHandler()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb135a) [0x14afcccb135a] __cxxabiv1::__terminate()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb13c5) [0x14afcccb13c5]
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb134f) [0x14afcccb134f]
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xcc860b) [0x14af804c860b] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xdbbf4) [0x14afcccdbbf4] execute_native_thread_routine
/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x14b00f5e4609] start_thread
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x14b00f3af353] __clone

*** SIGABRT received at time=1707523587 on cpu 16 ***
PC: @     0x14b00f2d300b  (unknown)  raise
    @     0x14b00f5f0420       3792  (unknown)
    @     0x14afcccb135a  (unknown)  __cxxabiv1::__terminate()
    @     0x14afcccb1070  (unknown)  (unknown)
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361: *** SIGABRT received at time=1707523587 on cpu 16 ***
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361: PC: @     0x14b00f2d300b  (unknown)  raise
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14b00f5f0420       3792  (unknown)
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14afcccb135a  (unknown)  __cxxabiv1::__terminate()
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14afcccb1070  (unknown)  (unknown)
Fatal Python error: Aborted

Extension modules: mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2, regex._regex, scipy._lib._ccallback_c, yaml._yaml, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, _brotli, markupsafe._speedups, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, zstandard.backend_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 98)
ffolkes1911 commented 4 months ago

Tried same options as above, and using Ray, but it did not help.

What did work was using a GPTQ model, as it seems that only AWQ models hang (only tried those two on multi-GPU)
EDIT: tested on TheBloke/Llama-2-13B-chat-AWQ and GPTQ
EDIT2: seems that this issue is about Ray Cluster, whereas I was just adding --tensor-parallel-size to vllm, so might be different issue

BilalKHA95 commented 4 months ago

i've the same issue did you found a solution ? @ffolkes1911

jony0113 commented 4 months ago

I have the similar issue, but it can eventually work after about 40 minutes, I have describe the detail in #2959

Kaotic3 commented 4 months ago

Hey WoosuKwon.

I just cloned the repo and built it and then started Ray on two machines and then initiated vLLM with tensors=4.

The result is that vLLM is hanging and not moving past the "Initializing an LLM engine with config:...."

While I think that PR no doubt fixed some problem, it doesn't appear to have fixed this problem - which is that using Ray Cluster across two different machines results in vLLM hanging and not starting.

viewv commented 3 months ago

Hey WoosuKwon.

I just cloned the repo and built it and then started Ray on two machines and then initiated vLLM with tensors=4.

The result is that vLLM is hanging and not moving past the "Initializing an LLM engine with config:...."

While I think that PR no doubt fixed some problem, it doesn't appear to have fixed this problem - which is that using Ray Cluster across two different machines results in vLLM hanging and not starting.

I have this issue too, I don't know how to fix it. Fixed: https://github.com/vllm-project/vllm/issues/2826#issuecomment-2014666364

thelongestusernameofall commented 3 months ago

Hi, Just started using vLLM two hours ago, and I had exactly the same issue. I managed to make it work by disabling NCCL_P2P. For that, I exported NCCL_P2P_DISABLE=1.

Let me know if this solves your issue as well :)

export NCCL_P2P_DISABLE=1 worked for me. I'm using A6000 * 8, loading model with vllm, hanging and followed by a core dump after very long waiting.

Thanks very much.

viewv commented 3 months ago

Hi, Just started using vLLM two hours ago, and I had exactly the same issue. I managed to make it work by disabling NCCL_P2P. For that, I exported NCCL_P2P_DISABLE=1. Let me know if this solves your issue as well :)

export NCCL_P2P_DISABLE=1 worked for me. I'm using A6000 * 8, loading model with vllm, hanging and followed by a core dump after very long waiting.

Thanks very much.

Thank you very much, I have fixed the problem, the problem is that I have multiple network card, so I use the NCCL_SOCKET_IFNAME=eth0 to select the correct network card, and fix it.

huiyeruzhou commented 2 months ago

try “ray stop” command, it does work for me.

ChristineSeven commented 1 month ago

I am experiencing the same at the moment. For me, it happens with GPTQ quantisation with tp=4.

I have tried the following settings / combinations of settings without any luck:

NCCL_P2P_DISABLE=1 disable_custom_all_reduce=True enforce_eager=True

Latest vLLM, compiled from source. It hangs at approx 12995GB VRAM on each card across 4x3090. 70b model llama2.

Finally hung at this after approx 1h:

[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
(RayWorkerVllm pid=1486) [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747549 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
[2024-02-10 00:06:27,464 E 13 1719] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
[2024-02-10 00:06:27,501 E 13 1719] logging.cc:104: Stack trace: 
 /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xfebb5a) [0x14af5a1ebb5a] ray::operator<<()
/opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xfee298) [0x14af5a1ee298] ray::TerminateHandler()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb135a) [0x14afcccb135a] __cxxabiv1::__terminate()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb13c5) [0x14afcccb13c5]
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb134f) [0x14afcccb134f]
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xcc860b) [0x14af804c860b] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xdbbf4) [0x14afcccdbbf4] execute_native_thread_routine
/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x14b00f5e4609] start_thread
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x14b00f3af353] __clone

*** SIGABRT received at time=1707523587 on cpu 16 ***
PC: @     0x14b00f2d300b  (unknown)  raise
    @     0x14b00f5f0420       3792  (unknown)
    @     0x14afcccb135a  (unknown)  __cxxabiv1::__terminate()
    @     0x14afcccb1070  (unknown)  (unknown)
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361: *** SIGABRT received at time=1707523587 on cpu 16 ***
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361: PC: @     0x14b00f2d300b  (unknown)  raise
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14b00f5f0420       3792  (unknown)
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14afcccb135a  (unknown)  __cxxabiv1::__terminate()
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14afcccb1070  (unknown)  (unknown)
Fatal Python error: Aborted

Extension modules: mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2, regex._regex, scipy._lib._ccallback_c, yaml._yaml, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, _brotli, markupsafe._speedups, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, zstandard.backend_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 98)

How did you fix this? I got the same issue in vllm version 0.3.3 on A100 2 cards.Thanks in advance