run llama 3.1 405B with multi node has tp server error [Bug]

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

https://sglang.readthedocs.io/en/latest/

Apache License 2.0

5.69k stars 451 forks source link

run llama 3.1 405B with multi node has tp server error [Bug] #868

Closed kinglion811 closed 2 weeks ago

kinglion811 commented 2 months ago

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I have two node,node1 and node2,every node eth set is eth0 is the controller network eth1 to eth8 is the GPU IB network, I run llama 3.1 405B by sglang,like that

the error is

Reproduction

sglang:latest

Environment

two node,node1 and node2,every node eth set is
eth0 is the controller network
eth1 to eth8 is the GPU IB network

zhyncs commented 2 months ago

ref https://github.com/sgl-project/sglang?tab=readme-ov-file#run-llama-31-405b

min-xu-et commented 2 months ago

I have seen this error (gloo mesh connection failed) with vllm too. I think it is related to your network setup. I wasn't able to find a solution other than using a different set of machines. For me, using H100s from AWS didn't work but using A100s from AWS did work (with exact same OS software and vllm code).

You might want to specific GLOO_SOCKET_IFNAME to your nic interface but it didn't help for me. Other than that, you might want to disable all but the network interface you are using.

min-xu-et commented 2 months ago

In your case, perhaps gloo only needs eth0 since my understanding is that gloo is only used for some low bandwidth coordination between the nodes using a CPU process group (at least for vllm).

kinglion811 commented 2 months ago

In your case, perhaps gloo only needs eth0 since my understanding is that gloo is only used for some low bandwidth coordination between the nodes using a CPU process group (at least for vllm).

GLOO_SOCKET_IFNAME is set eth0(not roce and IB network)，there is no gloo error， but if --nccl-init-addr is set to eth1 ip，the error is

if --nccl-init-addr is set to eth0‘s ip ，the error is

the --nccl-init-addr is what？

kinglion811 commented 2 months ago

ref https://github.com/sgl-project/sglang?tab=readme-ov-file#run-llama-31-405b

that can't solve my question

hrukalive commented 2 months ago

Readme says GLOO, but you can also try to set NCCL_SOCKET_IFNAME=<your interface name> as well.

db24 commented 2 months ago

I solved by running docker with args "--shm-size=1g --ulimit memlock=-1" ref: https://help.aliyun.com/zh/egs/support/faq-1 I guess it not relevant to sglang, purely nccl issue

github-actions[bot] commented 2 weeks ago

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.