[Llama-2-13b-chat-hf] IPv6 Network Address Retrieval Error on 4 V100s 16GB

sksq96 commented 1 year ago

Hello,

I'm encountering an issue while running the following code:

from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-13b-chat-hf", tensor_parallel_size=2)

The hardware I'm using is 4 V100s with 16GB each. The error I'm receiving is as follows:

(Worker pid=22514) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 16516) cannot be retrieved (gai error: -2 - Name or service not known).
(Worker pid=22513) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 16516) cannot be retrieved (gai error: -2 - Name or service not known). [repeated 10x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(Worker pid=22513) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 16516) cannot be retrieved (gai error: -2 - Name or service not known). [repeated 10x across cluster]

Any help or guidance on how to resolve this issue would be greatly appreciated.

Thank you

efraisse commented 1 year ago

@sksq96 Here is what somehow worked for me (edit: issue was fixed as per @wolegechu's comment)

ahernandezSecurityScorecard commented 1 year ago

but do you know what this is refering to? because is not working for me

efraisse commented 1 year ago

but do you know what this is refering to? because is not working for me

edit: issue was fixed as per @wolegechu I am not entirely certain, I checked out this thread: https://github.com/pytorch/pytorch/issues/74824 and someone said "For some reason, MASTER_ADDR is not being set correctly under elastic/agent/server/api.py, The IP address is being encoded but not decoded? Looping in Pytorch R2P team to investigate" so I assumed that that was the underlying issue. I don't think my solution is probably a good one either it was just something that worked for me so I hoped the ugly workaround could potentially work for others as well.

imoneoi commented 1 year ago

@sksq96 I had the same problem on 4 GPUs with exactly the same error message. Run on a single GPU and disabling Ray can temporarily solve it.

wolegechu commented 1 year ago

There was a non-deterministic bug in the old implementation of the initialize_cluster method, but it has been fixed. Simply build and use the latest version from the main branch.

ahernandezSecurityScorecard commented 1 year ago

I did run into a situation where I had the "Name or Service not know error" or this error If added this before (which solves the second error) when I don't get the "Name or Service not know error", it works! export NCCL_IGNORE_DISABLED_P2P=1

xxm1668 commented 1 year ago

(Worker pid=2693738) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 12693) cannot be retrieved (gai error: -3 - Temporary failure in name resolution). add export NCCL_IGNORE_DISABLED_P2P=1，it does not work ! I need help

Laych7 commented 1 year ago

When I first ran it, the graphics memory was occupied, but it quickly cleared the launch process automatically. The subsequent runs followed the same IPV6 error message：

from vllm import LLM, SamplingParams

os.environ["NCCL_IGNORE_DISABLED_P2P"] = "1"
llm = LLM(model="model/Llama-2-13b-chat-hf", tensor_parallel_size=2, tokenizer="hf-internal-testing/llama-tokenizer")

Error：

(Worker pid=2842646) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 15051) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
(Worker pid=2842646) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 15051) cannot be retrieved (gai error: -3 - Temporary failure in name resolution). [repeated 12x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(Worker pid=2842646) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 15051) cannot be retrieved (gai error: -3 - Temporary failure in name resolution). [repeated 10x across cluster]

And not only will the llama model fail, but everything else will also fail

xxm1668 commented 1 year ago

export NCCL_IGNORE_DISABLED_P2P=1；歇一会就好了，我的就是这样

Laych7 commented 1 year ago

ok 我也可以了

BaileyWei commented 1 year ago

thanks a lot!!!! it works finally!!!!!!

I did run into a situation where I had the "Name or Service not know error" or this error If added this before (which solves the second error) when I don't get the "Name or Service not know error", it works! export NCCL_IGNORE_DISABLED_P2P=1

imoneoi commented 1 year ago

thanks a lot!!!! it works finally!!!!!!

I did run into a situation where I had the "Name or Service not know error" or this error If added this before (which solves the second error) when I don't get the "Name or Service not know error", it works! export NCCL_IGNORE_DISABLED_P2P=1

Had the same issue. Also solved by export NCCL_IGNORE_DISABLED_P2P=1

zhuohan123 commented 1 year ago

Please refer to #645 for the fix.

SuperBruceJia commented 8 months ago

export NCCL_IGNORE_DISABLED_P2P=1 didn't work for me.

Besides, to add the 127.0.0.1 __internal_head__, I don't have sudo access to the /etc/hosts file, unfortunately.

Is there any way to solve the problem?

Thank you very much in advance!

vllm-project / vllm

[Llama-2-13b-chat-hf] IPv6 Network Address Retrieval Error on 4 V100s 16GB #570