Closed kinglion811 closed 2 weeks ago
I have seen this error (gloo mesh connection failed) with vllm too. I think it is related to your network setup. I wasn't able to find a solution other than using a different set of machines. For me, using H100s from AWS didn't work but using A100s from AWS did work (with exact same OS software and vllm code).
You might want to specific GLOO_SOCKET_IFNAME to your nic interface but it didn't help for me. Other than that, you might want to disable all but the network interface you are using.
In your case, perhaps gloo only needs eth0 since my understanding is that gloo is only used for some low bandwidth coordination between the nodes using a CPU process group (at least for vllm).
In your case, perhaps gloo only needs eth0 since my understanding is that gloo is only used for some low bandwidth coordination between the nodes using a CPU process group (at least for vllm).
GLOO_SOCKET_IFNAME is set eth0(not roce and IB network),there is no gloo error, but if --nccl-init-addr is set to eth1 ip,the error is
if --nccl-init-addr is set to eth0‘s ip ,the error is
the --nccl-init-addr is what?
ref https://github.com/sgl-project/sglang?tab=readme-ov-file#run-llama-31-405b
that can't solve my question
Readme says GLOO, but you can also try to set NCCL_SOCKET_IFNAME=<your interface name>
as well.
I solved by running docker with args "--shm-size=1g --ulimit memlock=-1" ref: https://help.aliyun.com/zh/egs/support/faq-1 I guess it not relevant to sglang, purely nccl issue
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.
Checklist
Describe the bug
I have two node,node1 and node2,every node eth set is eth0 is the controller network eth1 to eth8 is the GPU IB network, I run llama 3.1 405B by sglang,like that
the error is
Reproduction
sglang:latest
Environment