kubeagi / arcadia

A diverse, simple, and secure one-stop LLMOps platform
http://www.kubeagi.com/
Apache License 2.0
78 stars 23 forks source link

vLLM: RayWorkerVllm.execute_method failed with error 'TCPStore is not available' #878

Closed nkwangleiGIT closed 6 months ago

nkwangleiGIT commented 6 months ago

Ray v2.9.3 vLLM 0.3.3

[RayWorkerVllm.execute_method] failed with callstack below:

Error Type: TASK_EXECUTION_EXCEPTION

home/ray/anaconda3/lib/python3.9/site-packages/cupyx/distributed/_nccl_comm.py", line 97, in _init_with_tcp_store
self._store_proxy.barrier()
File "/home/ray/anaconda3/lib/python3.9/site-packages/cupyx/distributed/_store.py", line 152, in barrier
self._send_recv(_store_actions.Barrier())
File "/home/ray/anaconda3/lib/python3.9/site-packages/cupyx/distributed/_store.py", line 142, in _send_recv
raise RuntimeError('TCPStore is not available')
RuntimeError: TCPStore is not available

Related issues on Ray and vLLM: https://github.com/ray-project/ray/issues/43756 https://github.com/vllm-project/vllm/issues/3334

image

nkwangleiGIT commented 6 months ago

add --enforce-eager to EXTRA_ARGS to disable cupy, and vllm mode can work normally.

bjwswang commented 6 months ago

should be fixed by https://github.com/kubeagi/arcadia/pull/884