vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.1k stars 3.83k forks source link

[Llama-2-13b-chat-hf] IPv6 Network Address Retrieval Error on 4 V100s 16GB #570

Closed sksq96 closed 1 year ago

sksq96 commented 1 year ago

Hello,

I'm encountering an issue while running the following code:

from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-13b-chat-hf", tensor_parallel_size=2)

The hardware I'm using is 4 V100s with 16GB each. The error I'm receiving is as follows:

(Worker pid=22514) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 16516) cannot be retrieved (gai error: -2 - Name or service not known).
(Worker pid=22513) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 16516) cannot be retrieved (gai error: -2 - Name or service not known). [repeated 10x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(Worker pid=22513) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 16516) cannot be retrieved (gai error: -2 - Name or service not known). [repeated 10x across cluster]

Any help or guidance on how to resolve this issue would be greatly appreciated.

Thank you

efraisse commented 1 year ago

@sksq96 Here is what somehow worked for me (edit: issue was fixed as per @wolegechu's comment)

ahernandezSecurityScorecard commented 1 year ago

but do you know what this is refering to? because is not working for me

efraisse commented 1 year ago

but do you know what this is refering to? because is not working for me

edit: issue was fixed as per @wolegechu I am not entirely certain, I checked out this thread: https://github.com/pytorch/pytorch/issues/74824 and someone said "For some reason, MASTER_ADDR is not being set correctly under elastic/agent/server/api.py, The IP address is being encoded but not decoded? Looping in Pytorch R2P team to investigate" so I assumed that that was the underlying issue. I don't think my solution is probably a good one either it was just something that worked for me so I hoped the ugly workaround could potentially work for others as well.

imoneoi commented 1 year ago

@sksq96 I had the same problem on 4 GPUs with exactly the same error message. Run on a single GPU and disabling Ray can temporarily solve it.

wolegechu commented 1 year ago

There was a non-deterministic bug in the old implementation of the initialize_cluster method, but it has been fixed. Simply build and use the latest version from the main branch.

ahernandezSecurityScorecard commented 1 year ago

I did run into a situation where I had the "Name or Service not know error" or this error If added this before (which solves the second error) when I don't get the "Name or Service not know error", it works! export NCCL_IGNORE_DISABLED_P2P=1

xxm1668 commented 1 year ago

(Worker pid=2693738) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 12693) cannot be retrieved (gai error: -3 - Temporary failure in name resolution). add export NCCL_IGNORE_DISABLED_P2P=1,it does not work ! I need help

Laych7 commented 1 year ago

When I first ran it, the graphics memory was occupied, but it quickly cleared the launch process automatically. The subsequent runs followed the same IPV6 error message:

from vllm import LLM, SamplingParams

os.environ["NCCL_IGNORE_DISABLED_P2P"] = "1"
llm = LLM(model="model/Llama-2-13b-chat-hf", tensor_parallel_size=2, tokenizer="hf-internal-testing/llama-tokenizer")

Error:

(Worker pid=2842646) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 15051) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
(Worker pid=2842646) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 15051) cannot be retrieved (gai error: -3 - Temporary failure in name resolution). [repeated 12x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(Worker pid=2842646) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 15051) cannot be retrieved (gai error: -3 - Temporary failure in name resolution). [repeated 10x across cluster]

And not only will the llama model fail, but everything else will also fail

xxm1668 commented 1 year ago

export NCCL_IGNORE_DISABLED_P2P=1;歇一会就好了,我的就是这样

Laych7 commented 1 year ago

ok 我也可以了

BaileyWei commented 1 year ago

thanks a lot!!!! it works finally!!!!!!

I did run into a situation where I had the "Name or Service not know error" or this error If added this before (which solves the second error) when I don't get the "Name or Service not know error", it works! export NCCL_IGNORE_DISABLED_P2P=1

imoneoi commented 1 year ago

thanks a lot!!!! it works finally!!!!!!

I did run into a situation where I had the "Name or Service not know error" or this error If added this before (which solves the second error) when I don't get the "Name or Service not know error", it works! export NCCL_IGNORE_DISABLED_P2P=1

Had the same issue. Also solved by export NCCL_IGNORE_DISABLED_P2P=1

zhuohan123 commented 1 year ago

Please refer to #645 for the fix.

SuperBruceJia commented 8 months ago

export NCCL_IGNORE_DISABLED_P2P=1 didn't work for me.

Besides, to add the 127.0.0.1 __internal_head__, I don't have sudo access to the /etc/hosts file, unfortunately.

Is there any way to solve the problem?

Thank you very much in advance!