vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.12k stars 4.55k forks source link

The IPv6 network addresses of (__internal_head__, 18566) cannot be retrieved #645

Closed yinochaos closed 1 year ago

yinochaos commented 1 year ago

I run this code

from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-70b-chat-hf", tensor_parallel_size=4)

get errors:

(Worker pid=816915) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 18566) cannot be retrieved (gai error: -2 - Name or service not known). [repeated 20x across cluster]
(Worker pid=816915) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 18566) cannot be retrieved (gai error: -2 - Name or service not known). [repeated 20x across cluster]
(Worker pid=816915) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 18566) cannot be retrieved (gai error: -2 - Name or service not known). [repeated 20x across cluster]
(Worker pid=816915) [W socket.cpp:601] [c10d] The IPv6 network addresses of (__internal_head__, 18566) cannot be retrieved (gai error: -2 - Name or service not known). [repeated 20x across cluster]

I ref #570 this issue, export NCCL_IGNORE_DISABLED_P2P=1,and then wait for 8mins, run code, above error happen again Any help or guidance on how to resolve this issue would be greatly appreciated.

Thank you

yinochaos commented 1 year ago

similar to AWS, it's TikTok's cloud server which called “volcengine” which I run code on

GPU is A800 80G;nvidia-smi

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800-SXM...  On   | 00000000:69:01.0 Off |                    0 |
| N/A   31C    P0   107W / 400W |      3MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800-SXM...  On   | 00000000:69:02.0 Off |                    0 |
| N/A   31C    P0    61W / 400W |      3MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A800-SXM...  On   | 00000000:6B:01.0 Off |                    0 |
| N/A   31C    P0    65W / 400W |      3MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A800-SXM...  On   | 00000000:6B:02.0 Off |                    0 |
| N/A   30C    P0    62W / 400W |      3MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
imoneoi commented 1 year ago

+1. Have the same problem even using the flag.

ktrapeznikov commented 1 year ago

same issue here

zhejiangyyf commented 1 year ago

+1 me to

KYLN24 commented 1 year ago

+1, are there any solutions?

HermitSun commented 1 year ago

As https://github.com/pytorch/pytorch/issues/74824#issuecomment-1500144250 says, you can try to add your IP and __internal_head__ to /etc/hosts, this works for me.

For example:

127.0.0.1 __internal_head__
yinochaos commented 1 year ago

As pytorch/pytorch#74824 (comment) says, you can try to add your IP and __internal_head__ to /etc/hosts, this works for me.

For example:

127.0.0.1 __internal_head__

3Q, this works for me too!!! : )

SuperBruceJia commented 10 months ago

export NCCL_IGNORE_DISABLED_P2P=1 didn't work for me.

Besides, to add the 127.0.0.1 __internal_head__, I don't have sudo access to the /etc/hosts file, unfortunately.

Is there any way to solve the problem?

Thank you very much in advance!

Maxppddcsz commented 10 months ago

export NCCL_IGNORE_DISABLED_P2P=1 didn't work for me.

Besides, to add the 127.0.0.1 __internal_head__, I don't have sudo access to the /etc/hosts file, unfortunately.

Is there any way to solve the problem?

Thank you very much in advance!

@SuperBruceJia I have the same question,did you find it out?

SuperBruceJia commented 10 months ago

export NCCL_IGNORE_DISABLED_P2P=1 didn't work for me. Besides, to add the 127.0.0.1 __internal_head__, I don't have sudo access to the /etc/hosts file, unfortunately. Is there any way to solve the problem? Thank you very much in advance!

@SuperBruceJia I have the same question,did you find it out?

I'm sorry to say that I haven't found a solution yet. For now, I'm utilizing only one GPU for inference.

You could try upgrading vllm to see if it resolves the issue, e.g., vllm-0.1.6 or vllm-0.2.0.

Good luck!

Best regards,

Shuyue

Dec. 30th, 2023

allendred commented 10 months ago

export NCCL_IGNORE_DISABLED_P2P=1 didn't work for me. Besides, to add the 127.0.0.1 __internal_head__, I don't have sudo access to the /etc/hosts file, unfortunately. Is there any way to solve the problem? Thank you very much in advance!

@SuperBruceJia I have the same question,did you find it out?

I'm sorry to say that I haven't found a solution yet. For now, I'm utilizing only one GPU for inference.

You could try upgrading vllm to see if it resolves the issue, e.g., vllm-0.1.6 or vllm-0.2.0.

Good luck!

Best regards,

Shuyue

Dec. 30th, 2023

0.2.7 It's not working

SuperBruceJia commented 10 months ago

export NCCL_IGNORE_DISABLED_P2P=1 didn't work for me. Besides, to add the 127.0.0.1 __internal_head__, I don't have sudo access to the /etc/hosts file, unfortunately. Is there any way to solve the problem? Thank you very much in advance!

@SuperBruceJia I have the same question,did you find it out?

I'm sorry to say that I haven't found a solution yet. For now, I'm utilizing only one GPU for inference. You could try upgrading vllm to see if it resolves the issue, e.g., vllm-0.1.6 or vllm-0.2.0. Good luck! Best regards, Shuyue Dec. 30th, 2023

0.2.7 It's not working

I'm sorry to say that I haven't found a solution yet. I will let you know if I can figure it out.

Best regards,

Shuyue Jan. 15th, 2024

e-cal commented 7 months ago

@SuperBruceJia @allendred you should be able to just use ~/.hosts if you don't have sudo access for /etc/hosts

127.0.0.1 __internal_head__