ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.77k stars 5.55k forks source link

[Core]: TCPStore is not available (with vLLM) #43756

Open rchernan-dell opened 5 months ago

rchernan-dell commented 5 months ago

What happened + What you expected to happen

When vLLM distributes the work to run an api_server distributed on 2 machines to Ray, it is expected that Ray will execute the work successfully and run the service. However the following error is thrown:

Error Type: TASK_EXECUTION_EXCEPTION

/python3.9/site-packages/cupyx/distributed/_nccl_comm.py", line 97, in _init_with_tcp_store
    self._store_proxy.barrier()
  File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/cupyx/distributed/_store.py", line 152, in barrier
    self._send_recv(_store_actions.Barrier())
  File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/cupyx/distributed/_store.py", line 142, in _send_recv
    raise RuntimeError('TCPStore is not available')
RuntimeError: TCPStore is not available

Versions / Dependencies

2 machines

Both machines:

Reproduction script

Issue Severity

High: It blocks me from completing my task.

nkwangleiGIT commented 5 months ago

same issue here

rkooo567 commented 5 months ago

This must be cupy issue, not Ray. Basically something is not going well with your env + cupy nccl backend.

For the workaround, you can do enforce_eager=True to vllm to workaround this I believe (because it disables cupy backend).

gentleman-turk commented 5 months ago

setting enforce_eager=True yields a new bug:

INFO 03-26 13:20:22 llm_engine.py:87] Initializing an LLM engine with config: model='TheBloke/Llama-2-7b-Chat-AWQ', tokenizer='TheBloke/Llama-2-7b-Chat-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) Dell-Dev-U:2803813:2803813 [0] NCCL INFO Bootstrap : Using eno1:192.168.1.37<0> Dell-Dev-U:2803813:2803813 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

Dell-Dev-U:2803813:2803813 [0] init.cc:1270 NCCL WARN Invalid config blocking attribute value -2147483648 Traceback (most recent call last): File "/home/rch/dev/ubiquitous-distributed-ai/ray-vllm-wsl/exampleawq.py", line 16, in llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ", tensor_parallel_size=2, enforce_eager=True) File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 109, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 391, in from_engine_args engine = cls(engine_configs, File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 126, in init self._init_workers_ray(placement_group) File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 304, in _init_workers_ray self._run_workers("init_model", File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 1041, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/worker/worker.py", line 94, in init_model init_distributed_environment(self.parallel_config, self.rank, File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/worker/worker.py", line 283, in init_distributed_environment torch.distributed.all_reduce(torch.zeros(1).cuda()) File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(args, **kwargs) File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce work = group.allreduce([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.14.3 ncclInvalidArgument: Invalid value for an argument. Last error: Invalid config blocking attribute value -2147483648