vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.68k stars 4.08k forks source link

[Bug]: 使用vllm+ray分布式推理报错 #5779

Open JKYtydt opened 3 months ago

JKYtydt commented 3 months ago

Your current environment

Python==3.10.14 vllm==0.5.0.post1 ray==2.24.0

Node status

Active: 1 node_37c2b26800cc853721ef351ca107c298ae77efcb5504d8e0c900ed1d 1 node_62d48658974f4114465450f53fd97c10fcfe6d40b4e896a90a383682 Pending: (no pending nodes) Recent failures: (no failures)

Resources

Usage: 0.0/52.0 CPU 0.0/2.0 GPU 0B/9.01GiB memory 0B/4.14GiB object_store_memory

Demands: (no resource demands)

🐛 Describe the bug

在使用 Gloo 进行全网格连接时遇到了问题,没有找到解决办法 脚本如下: from vllm import LLM prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ]

llm = LLM(model="/mnt/d/llm/qwen/qwen1.5_0.5b", trust_remote_code=True, gpu_memory_utilization=0.4,enforce_eager=True,tensor_parallel_size=2,swap_space=1)

outputs = llm.generate(prompts) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

报错如下: rank0: Traceback (most recent call last): rank0: File "/data/vllm_test.py", line 13, in rank0: llm = LLM(model="/mnt/d/llm/qwen/qwen1.5_0.5b", trust_remote_code=True, gpu_memory_utilization=0.4,enforce_eager=True,tensor_parallel_size=2,swap_space=1) rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 144, in init rank0: self.llm_engine = LLMEngine.from_engine_args( rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 363, in from_engine_args rank0: engine = cls( rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 223, in init rank0: self.model_executor = executor_class( rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init rank0: super().init(*args, **kwargs) rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init

rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor

rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 171, in _init_workers_ray

rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers rank0: driver_worker_output = self.driver_worker.execute_method( rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method rank0: raise e rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method rank0: return executor(*args, *kwargs) rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker.py", line 115, in init_device rank0: init_worker_distributed_environment(self.parallel_config, self.rank, rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker.py", line 354, in init_worker_distributed_environment rank0: init_distributed_environment(parallel_config.world_size, rank, rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 553, in init_distributed_environment rank0: _WORLD = GroupCoordinator( rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 120, in init rank0: cpu_group = torch.distributed.new_group(ranks, backend="gloo") rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper rank0: func_return = func(args, **kwargs) rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group rank0: return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization) rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag rank0: pg, pg_store = _new_process_group_helper( rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper rank0: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) rank0: RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

thies1006 commented 3 months ago

I had the same problem. I solved it by setting the variables:

os.environ['GLOO_SOCKET_IFNAME'] = 'ib0'
os.environ['TP_SOCKET_IFNAME'] = 'ib0'

In addition I had to remove those variables from the environment:

http_proxy
https_proxy
ftp_proxy

ray==2.24.0 vllm==0.4.3

youkaichao commented 3 months ago

@thies1006 did you try https://docs.vllm.ai/en/latest/getting_started/debugging.html , especially the sanity check script? I assume it should catch your problem.

I believe it is caused by GLOO_SOCKET_IFNAME .

TP_SOCKET_IFNAME is for https://github.com/pytorch/tensorpipe , and gloo should not use http/https.

JKYtydt commented 3 months ago

我有同样的问题。 我通过设置变量解决了它:

os.environ['GLOO_SOCKET_IFNAME'] = 'ib0'
os.environ['TP_SOCKET_IFNAME'] = 'ib0'

此外,我还必须从环境中删除这些变量:

http_proxy
https_proxy
ftp_proxy

ray==2.24.0 vllm==0.4.3

非常感谢您提供的解决办法,我尝试以后依旧是报这个错,我添加的是这两行代码,不知是否有问题os.environ['GLOO_SOCKET_IFNAME'] = 'eth0' os.environ['TP_SOCKET_IFNAME'] = 'eth0'

JKYtydt commented 3 months ago

@thies1006您是否尝试过https://docs.vllm.ai/en/latest/getting_started/debugging.html,尤其是健全性检查脚本?我认为它应该可以解决您的问题。

我相信这是由 造成的GLOO_SOCKET_IFNAME

TP_SOCKET_IFNAME适用于https://github.com/pytorch/tensorpipe,并且 gloo 不应该使用 http/https。

非常感谢您提供的建议,我用脚本测试了GPU通信,发生了报错,应该就是这个导致实验无法继续的,不知道您是否有解决方向

import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
data = torch.FloatTensor([1,] * 128).to(f"cuda:{local_rank}")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
assert value == dist.get_world_size()

报错如下:

[E socket.cpp:957] [c10d] The client socket has timed out after 60s while trying to connect to (192.168.41.79, 29502).
Traceback (most recent call last):
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 157, in _create_tcp_store
    store = TCPStore(
torch.distributed.DistNetworkError: The client socket has timed out after 60s while trying to connect to (192.168.41.79, 29502).

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jky/miniconda3/envs/ray/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 235, in launch_agent
    rdzv_handler=rdzv_registry.get_rendezvous_handler(rdzv_parameters),
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66, in get_rendezvous_handler
    return handler_registry.create_handler(params)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/api.py", line 263, in create_handler
    handler = creator(params)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36, in _create_c10d_handler
    backend, store = create_backend(params)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 255, in create_backend
    store = _create_tcp_store(params)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 181, in _create_tcp_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
thies1006 commented 3 months ago

Hi @JKYtydt Here are my steps to run on two nodes.

JKYtydt commented 3 months ago

@youkaichao 您好,我现在尝试了能找到的解决办法,依旧没能解决这个问题,不知道您这边还有什么其他解决办法,或者需要提供哪些其他信息,帮助解决这个问题,非常感谢

我是在两台windows系统下的电脑进行的,都是Ubuntu系统,两个节点互相能ping通,根据第一次您给我的回复,应该是两个节点的GPU无法通信。不知道这对您是否有所帮助

JKYtydt commented 3 months ago

@thies1006 非常感谢您的回复,我还是没能解决这个问题

JKYtydt commented 3 months ago

在使用这个脚本测试,发现了另一个问题,如果没有设置--rdzv_backend=c10d这个参数,执行以下命令,报错如下

import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
data = torch.FloatTensor([1,] * 128).to(f"cuda:{local_rank}")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
assert value == dist.get_world_size()
 torchrun --nnodes 2 --nproc-per-node 1 --rdzv_endpoint=192.168.41.79:29502 test.py
Traceback (most recent call last):
  File "/home/jky/miniconda3/envs/ray/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent
    result = agent.run()
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run
    result = self._invoke_run(role)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 870, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 548, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
torch.distributed.DistStoreError: Timed out after 901 seconds waiting for clients. 1/2 clients joined.