Open JKYtydt opened 4 months ago
I had the same problem. I solved it by setting the variables:
os.environ['GLOO_SOCKET_IFNAME'] = 'ib0'
os.environ['TP_SOCKET_IFNAME'] = 'ib0'
In addition I had to remove those variables from the environment:
http_proxy
https_proxy
ftp_proxy
ray==2.24.0 vllm==0.4.3
@thies1006 did you try https://docs.vllm.ai/en/latest/getting_started/debugging.html , especially the sanity check script? I assume it should catch your problem.
I believe it is caused by GLOO_SOCKET_IFNAME
.
TP_SOCKET_IFNAME
is for https://github.com/pytorch/tensorpipe , and gloo should not use http/https.
我有同样的问题。 我通过设置变量解决了它:
os.environ['GLOO_SOCKET_IFNAME'] = 'ib0' os.environ['TP_SOCKET_IFNAME'] = 'ib0'
此外,我还必须从环境中删除这些变量:
http_proxy https_proxy ftp_proxy
ray==2.24.0 vllm==0.4.3
非常感谢您提供的解决办法,我尝试以后依旧是报这个错,我添加的是这两行代码,不知是否有问题os.environ['GLOO_SOCKET_IFNAME'] = 'eth0' os.environ['TP_SOCKET_IFNAME'] = 'eth0'
@thies1006您是否尝试过https://docs.vllm.ai/en/latest/getting_started/debugging.html,尤其是健全性检查脚本?我认为它应该可以解决您的问题。
我相信这是由 造成的
GLOO_SOCKET_IFNAME
。
TP_SOCKET_IFNAME
适用于https://github.com/pytorch/tensorpipe,并且 gloo 不应该使用 http/https。
非常感谢您提供的建议,我用脚本测试了GPU通信,发生了报错,应该就是这个导致实验无法继续的,不知道您是否有解决方向
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
data = torch.FloatTensor([1,] * 128).to(f"cuda:{local_rank}")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
assert value == dist.get_world_size()
报错如下:
[E socket.cpp:957] [c10d] The client socket has timed out after 60s while trying to connect to (192.168.41.79, 29502).
Traceback (most recent call last):
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 157, in _create_tcp_store
store = TCPStore(
torch.distributed.DistNetworkError: The client socket has timed out after 60s while trying to connect to (192.168.41.79, 29502).
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jky/miniconda3/envs/ray/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 235, in launch_agent
rdzv_handler=rdzv_registry.get_rendezvous_handler(rdzv_parameters),
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66, in get_rendezvous_handler
return handler_registry.create_handler(params)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/api.py", line 263, in create_handler
handler = creator(params)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36, in _create_c10d_handler
backend, store = create_backend(params)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 255, in create_backend
store = _create_tcp_store(params)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 181, in _create_tcp_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
Hi @JKYtydt Here are my steps to run on two nodes.
ray stop
on all nodes)export GLOO_SOCKET_IFNAME='<device_name>'
(not sure but I think it is enough to set this on the head node only)ray start --head --node-ip-address <ip_address head>
(for head node) and ray start --address='<ip_address head>'
(for other nodes)python -m vllm.entrypoints.openai.api_server --model <model_name> --tensor-parallel-size <world_size>
@youkaichao 您好,我现在尝试了能找到的解决办法,依旧没能解决这个问题,不知道您这边还有什么其他解决办法,或者需要提供哪些其他信息,帮助解决这个问题,非常感谢
我是在两台windows系统下的电脑进行的,都是Ubuntu系统,两个节点互相能ping通,根据第一次您给我的回复,应该是两个节点的GPU无法通信。不知道这对您是否有所帮助
@thies1006 非常感谢您的回复,我还是没能解决这个问题
在使用这个脚本测试,发现了另一个问题,如果没有设置--rdzv_backend=c10d这个参数,执行以下命令,报错如下
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
data = torch.FloatTensor([1,] * 128).to(f"cuda:{local_rank}")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
assert value == dist.get_world_size()
torchrun --nnodes 2 --nproc-per-node 1 --rdzv_endpoint=192.168.41.79:29502 test.py
Traceback (most recent call last):
File "/home/jky/miniconda3/envs/ray/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent
result = agent.run()
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run
result = self._invoke_run(role)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 870, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
self._rendezvous(worker_group)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 548, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
torch.distributed.DistStoreError: Timed out after 901 seconds waiting for clients. 1/2 clients joined.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
Python==3.10.14 vllm==0.5.0.post1 ray==2.24.0
Node status
Active: 1 node_37c2b26800cc853721ef351ca107c298ae77efcb5504d8e0c900ed1d 1 node_62d48658974f4114465450f53fd97c10fcfe6d40b4e896a90a383682 Pending: (no pending nodes) Recent failures: (no failures)
Resources
Usage: 0.0/52.0 CPU 0.0/2.0 GPU 0B/9.01GiB memory 0B/4.14GiB object_store_memory
Demands: (no resource demands)
🐛 Describe the bug
在使用 Gloo 进行全网格连接时遇到了问题,没有找到解决办法 脚本如下: from vllm import LLM prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ]
outputs = llm.generate(prompts) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
报错如下: rank0: Traceback (most recent call last): rank0: File "/data/vllm_test.py", line 13, in
rank0: llm = LLM(model="/mnt/d/llm/qwen/qwen1.5_0.5b", trust_remote_code=True, gpu_memory_utilization=0.4,enforce_eager=True,tensor_parallel_size=2,swap_space=1)
rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 144, in init
rank0: self.llm_engine = LLMEngine.from_engine_args(
rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 363, in from_engine_args
rank0: engine = cls(
rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 223, in init
rank0: self.model_executor = executor_class(
rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init
rank0: super().init(*args, **kwargs)
rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init
rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor
rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 171, in _init_workers_ray
rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers rank0: driver_worker_output = self.driver_worker.execute_method( rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method rank0: raise e rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method rank0: return executor(*args, *kwargs) rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker.py", line 115, in init_device rank0: init_worker_distributed_environment(self.parallel_config, self.rank, rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker.py", line 354, in init_worker_distributed_environment rank0: init_distributed_environment(parallel_config.world_size, rank, rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 553, in init_distributed_environment rank0: _WORLD = GroupCoordinator( rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 120, in init rank0: cpu_group = torch.distributed.new_group(ranks, backend="gloo") rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper rank0: func_return = func(args, **kwargs) rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group rank0: return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization) rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag rank0: pg, pg_store = _new_process_group_helper( rank0: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper rank0: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) rank0: RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error