Closed eldarkurtic closed 1 month ago
this is a very detailed issue report with the correct steps! 👍
my guess, is that, GLOO_SOCKET_IFNAME=enp51s0f1 vllm serve
only works for one node, GLOO_SOCKET_IFNAME=enp51s0f1
may not take effect for another node.
you can try to add the environment variable when you start the docker:
bash run_cluster.sh \
vllm/vllm-openai \
"192.168.201.210" \
--head \
"/home/eldar/.cache/huggingface" \
-e GLOO_SOCKET_IFNAME=enp51s0f1
and
bash run_cluster.sh \
vllm/vllm-openai \
"192.168.201.210" \
--worker \
"/home/eldar/.cache/huggingface" \
-e GLOO_SOCKET_IFNAME=enp51s0f1
see if it helps.
Yes, this resolves the GLOO issue and I am able to run vllm serve
across two nodes. Thanks a lot for the tip!
I'm not getting a gloo error, but just the
Exception ignored in: <function RayGPUExecutorAsync.__del__ at 0x7efd7cfd2ef0>
if self.forward_dag is not None:
AttributeError: 'RayGPUExecutorAsync' object has no attribute 'forward_dag'
in addition to a
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
might this be related? If so, what would be the resolution?
CUDA-capable device(s) is/are busy or unavailable
it might indicate that the gpu is broken. you need to talk to your admin to fix the gpu.
Your current environment
🐛 Describe the bug
Hi everyone, I am trying to reproduce results from the recent blog on Llama-3.1: https://blog.vllm.ai/2024/07/23/llama31.html. Namely, I am following the docs from https://docs.vllm.ai/en/latest/serving/distributed_serving.html#multi-node-inference-and-serving to set up multi-node serving on two 8xH100 servers.
Step 1) On the first node I am running:
and on the second node I am running:
This step seems to work fine as I am seeing this output in the console:
Step 2): To verify that
ray
sees all gpus from both servers, I rundocker exec -it node /bin/bash
andray status
outputs:which I assume is an indicator that this stage works fine given that it sees all 16 GPUs.
Step 3): I am trying to kick off the
vllm serve
command from the head node with:which crashes due to some
gloo
-related problem:Step 4): In order to debug and isolate this problem, based on the docs at https://docs.vllm.ai/en/latest/getting_started/debugging.html, I am trying to run the
test.py
script on both nodes at the same time:On the head node I am running it like:
NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1 test.py
, and on the worker node I am running it like:NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=192.168.201.210 test.py
. The NCCL part works fine, as I am seeing theNCCL is good!
printed in the console 8 times (once for each rank on the node). Unfortunately, the GLOO part fails with the same error message ([rank7]: RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
) as above in thevllm serve
attempt.Step 5): Following the debugging tips from the docs page, based on the output of
ip addr show
I am setting up
GLOO_SOCKET_IFNAME=enp51s0f1
on both servers and rerunning thetest.py
again. On the master node the command now looks like this:GLOO_SOCKET_IFNAME=enp51s0f1 NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1 test.py
, and on the worker node:GLOO_SOCKET_IFNAME=enp51s0f1 NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=192.168.201.210 test.py
. This seems to solve the problem and thetest.py
script runs successfully. The expectedsanity check is successful!
output is printed 8 times in the console.Step 6): With the hope that setting
GLOO_SOCKET_IFNAME=enp51s0f1
is the solution to the original GLOO issue I had withvllm serve
, I went back to rerun the command on the head node with this env variable as well. The command now looks like this:GLOO_SOCKET_IFNAME=enp51s0f1 vllm serve /home/meta-llama/Meta-Llama-3.1-8B-Instruct -tp 8 -pp 2
. Unfortunately I am again seeing the same error as without this env variable.Step 7): In a desperate attempt to figure out what is going on, I have tried to apply all suggestions from the debugging docs page and the command I ran looks like this:
GLOO_SOCKET_IFNAME=enp51s0f1 VLLM_LOGGING_LEVEL=DEBUG CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=TRACE VLLM_TRACE_FUNCTION=1 vllm serve /home/meta-llama/Meta-Llama-3.1-8B-Instruct -tp 8 -pp 2
. Unfortunately, I am not seeing anything different in the output compared to the previous runs.Sorry for the very long post, I have tried to provide as much information as possible. Any suggestions on what to try next are appreciated. I assume that there is something problematic on the vllm-GLOO relation, given that the
test.py
example worked just fine for both, NCCL and GLOO.