Open dmakhervaks opened 1 month ago
cc @Ying1123
set environment variable
export GLOO_SOCKET_IFNAME=eth0
@merrymercy after executing the following commands on node 1 and 2 respectively
set environment variable on each: export GLOO_SOCKET_IFNAME=eth0
then execute: GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --tp 4 --nccl-init 10.53.1.111:9009 --nnodes 2 --node-rank 0 GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --tp 4 --nccl-init 10.53.1.111:9009 --nnodes 2 --node-rank 1
this is what I see on node 2:
Checklist
Describe the bug
I am trying to run a model on 2 nodes, but seeing some issues related to ProcessGroupGloo. Seems like some sort of networking issue?
I am running via your your latest docker image on two separate nodes (v0.2.7-cu121)
I launch the following. python commands on each node docker respectively.
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --tp 4 --nccl-init 10.53.1.111:9009 --nnodes 2 --node-rank 0
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --tp 4 --nccl-init 10.53.1.111:9009 --nnodes 2 --node-rank 1
Reproduction
meta-llama/Meta-Llama-3.1-8B
Environment