[Core]: TCPStore is not available (with vLLM)

rchernan-dell commented 5 months ago

What happened + What you expected to happen

When vLLM distributes the work to run an api_server distributed on 2 machines to Ray, it is expected that Ray will execute the work successfully and run the service. However the following error is thrown:

Error Type: TASK_EXECUTION_EXCEPTION

/python3.9/site-packages/cupyx/distributed/_nccl_comm.py", line 97, in _init_with_tcp_store
    self._store_proxy.barrier()
  File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/cupyx/distributed/_store.py", line 152, in barrier
    self._send_recv(_store_actions.Barrier())
  File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/cupyx/distributed/_store.py", line 142, in _send_recv
    raise RuntimeError('TCPStore is not available')
RuntimeError: TCPStore is not available

Versions / Dependencies

2 machines

Dell Alienware R13
- Ubuntu 22.04.4 LTS (kernel: 6.5.0-21-generic)
- 1 x NVIDIA GeForce RTX 3090 - 24GB VRAM
- 64 GB RAM
- NVIDIA-SMI 545.23.08 \ Driver Version: 545.23.08 \ CUDA Version: 12.3
- nvcc --version | Cuda compilation tools, release 11.8, V11.8.89
Dell Precision 5820
- Ubuntu 20.04.6 LTS (kernel: 5.15.0-97-generic)
- 2 x NVIDIA RTX 5000 - 16GB VRAM
- 64 GB RAM
- NVIDIA-SMI 545.23.08 \ Driver Version: 545.23.08 \ CUDA Version: 12.3
- nvcc --version | Cuda compilation tools, release 11.8, V11.8.89

Both machines:

conda environment with python 3.8 (also tested with python 3.9)
torch==2.1.2
vllm==0.3.2
ray==2.9.3

Reproduction script

Head node
- ray start --head --dashboard-host=0.0.0.0 --min-worker-port=10002 --max-worker-port=10200
Worker node
- ray start --address='192.168.1.183:6379' --object-manager-port=35627 --node-manager-port=43947 --min-worker-port=10002 --max-worker-port=10200 --num-gpus 1
- python -m vllm.entrypoints.api_server --model TheBloke/Llama-2-7b-Chat-AWQ --tensor-parallel-size 2 --quantization awq

Issue Severity

High: It blocks me from completing my task.

nkwangleiGIT commented 5 months ago

same issue here

rkooo567 commented 5 months ago

This must be cupy issue, not Ray. Basically something is not going well with your env + cupy nccl backend.

For the workaround, you can do enforce_eager=True to vllm to workaround this I believe (because it disables cupy backend).

gentleman-turk commented 5 months ago

setting enforce_eager=True yields a new bug:

INFO 03-26 13:20:22 llm_engine.py:87] Initializing an LLM engine with config: model='TheBloke/Llama-2-7b-Chat-AWQ', tokenizer='TheBloke/Llama-2-7b-Chat-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) Dell-Dev-U:2803813:2803813 [0] NCCL INFO Bootstrap : Using eno1:192.168.1.37<0> Dell-Dev-U:2803813:2803813 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

Dell-Dev-U:2803813:2803813 [0] init.cc:1270 NCCL WARN Invalid config blocking attribute value -2147483648 Traceback (most recent call last): File "/home/rch/dev/ubiquitous-distributed-ai/ray-vllm-wsl/exampleawq.py", line 16, in llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ", tensor_parallel_size=2, enforce_eager=True) File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 109, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 391, in from_engine_args engine = cls(engine_configs, File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 126, in init self._init_workers_ray(placement_group) File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 304, in _init_workers_ray self._run_workers("init_model", File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 1041, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/worker/worker.py", line 94, in init_model init_distributed_environment(self.parallel_config, self.rank, File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/vllm/worker/worker.py", line 283, in init_distributed_environment torch.distributed.all_reduce(torch.zeros(1).cuda()) File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(args, **kwargs) File "/home/rch/miniconda3/envs/ray9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce work = group.allreduce([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.14.3 ncclInvalidArgument: Invalid value for an argument. Last error: Invalid config blocking attribute value -2147483648

ray-project / ray