Closed sksq96 closed 1 year ago
@sksq96 Here is what somehow worked for me (edit: issue was fixed as per @wolegechu's comment)
but do you know what this is refering to? because is not working for me
but do you know what this is refering to? because is not working for me
edit: issue was fixed as per @wolegechu I am not entirely certain, I checked out this thread: https://github.com/pytorch/pytorch/issues/74824 and someone said "For some reason, MASTER_ADDR is not being set correctly under elastic/agent/server/api.py, The IP address is being encoded but not decoded? Looping in Pytorch R2P team to investigate" so I assumed that that was the underlying issue. I don't think my solution is probably a good one either it was just something that worked for me so I hoped the ugly workaround could potentially work for others as well.
@sksq96 I had the same problem on 4 GPUs with exactly the same error message. Run on a single GPU and disabling Ray can temporarily solve it.
There was a non-deterministic bug in the old implementation of the initialize_cluster
method, but it has been fixed. Simply build and use the latest version from the main branch.
I did run into a situation where I had the "Name or Service not know error" or this error If added this before (which solves the second error) when I don't get the "Name or Service not know error", it works! export NCCL_IGNORE_DISABLED_P2P=1
(Worker pid=2693738) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 12693) cannot be retrieved (gai error: -3 - Temporary failure in name resolution). add export NCCL_IGNORE_DISABLED_P2P=1,it does not work ! I need help
When I first ran it, the graphics memory was occupied, but it quickly cleared the launch process automatically. The subsequent runs followed the same IPV6 error message:
from vllm import LLM, SamplingParams
os.environ["NCCL_IGNORE_DISABLED_P2P"] = "1"
llm = LLM(model="model/Llama-2-13b-chat-hf", tensor_parallel_size=2, tokenizer="hf-internal-testing/llama-tokenizer")
Error:
(Worker pid=2842646) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 15051) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
(Worker pid=2842646) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 15051) cannot be retrieved (gai error: -3 - Temporary failure in name resolution). [repeated 12x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(Worker pid=2842646) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 15051) cannot be retrieved (gai error: -3 - Temporary failure in name resolution). [repeated 10x across cluster]
And not only will the llama model fail, but everything else will also fail
export NCCL_IGNORE_DISABLED_P2P=1;歇一会就好了,我的就是这样
ok 我也可以了
thanks a lot!!!! it works finally!!!!!!
I did run into a situation where I had the "Name or Service not know error" or this error If added this before (which solves the second error) when I don't get the "Name or Service not know error", it works! export NCCL_IGNORE_DISABLED_P2P=1
thanks a lot!!!! it works finally!!!!!!
I did run into a situation where I had the "Name or Service not know error" or this error If added this before (which solves the second error) when I don't get the "Name or Service not know error", it works! export NCCL_IGNORE_DISABLED_P2P=1
Had the same issue. Also solved by export NCCL_IGNORE_DISABLED_P2P=1
Please refer to #645 for the fix.
export NCCL_IGNORE_DISABLED_P2P=1
didn't work for me.
Besides, to add the 127.0.0.1 __internal_head__
, I don't have sudo access to the /etc/hosts
file, unfortunately.
Is there any way to solve the problem?
Thank you very much in advance!
Hello,
I'm encountering an issue while running the following code:
The hardware I'm using is 4 V100s with 16GB each. The error I'm receiving is as follows:
Any help or guidance on how to resolve this issue would be greatly appreciated.
Thank you