pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.6k stars 22.23k forks source link

DDP training can not accept subnet address in IPV6 #108047

Open QSHLGZ opened 1 year ago

QSHLGZ commented 1 year ago

🐛 Describe the bug

When I pass a normal ipv6 address[fe80::4315:8136:2e6:13f8] to torch.distributed.run, starting command is shown below:

python -m torch.distributed.launch --rdzv_backend=static --nnodes=1 --nproc_per_node=4 --rdzv_endpoint=[fe80::121b:54ff:fe0f:41d3]:29500 testipv6.py

It can not be connected even I can ping this address successfully. Fisrtly, I thought IPV6 address is not supported in elastic launch. But when I pass [::1] to the script, the address is connected and the training started and finished. Thus, I turned on the debug log by:

export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL

The it gave me the error like below:

[I socket.cpp:454] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:504] [c10d - debug] The server socket is attempting to listen on [::]:29500.
[I socket.cpp:578] [c10d] The server socket has started to listen on [::]:29500.
[I TCPStore.cpp:252] [c10d - debug] The server has started on port = 29500.
[I socket.cpp:691] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (fe80::4315:8136:2e6:13f8, 29500).
[I socket.cpp:763] [c10d - trace] The client socket is attempting to connect to [xxx]:29500.
[W socket.cpp:665] [c10d] The client socket has failed to connect to [xxx]:29500 (errno: 22 - Invalid argument).
[I socket.cpp:700] [c10d - debug] The client socket will attempt to connect to an IPv4 address of (fe80::4315:8136:2e6:13f8, 29500).
[W socket.cpp:665] [c10d] The IPv4 network addresses of (fe80::4315:8136:2e6:13f8, 29500) cannot be retrieved (gai error: -9 - Address family for hostname not supported).
[W socket.cpp:665] [c10d] The IPv4 network addresses of (fe80::4315:8136:2e6:13f8, 29500) cannot be retrieved (gai error: -9 - Address family for hostname not supported).
[W socket.cpp:665] [c10d] The IPv4 network addresses of (fe80::4315:8136:2e6:13f8, 29500) cannot be retrieved (gai error: -9 - Address family for hostname not supported).
[W socket.cpp:665] [c10d] The IPv4 network addresses of (fe80::4315:8136:2e6:13f8, 29500) cannot be retrieved (gai error: -9 - Address family for hostname not supported).
[W socket.cpp:665] [c10d] The IPv4 network addresses of (fe80::4315:8136:2e6:13f8, 29500) cannot be retrieved (gai error: -9 - Address family for hostname not supported).

From the err log shown above, I can not understand why the ipv6 address is an Invalid argument. Thus I found the lastest code right at the line which caused the error. The lateset code in socket.cpp is shown below:

SocketConnectOp::ConnectResult SocketConnectOp::tryConnectCore(
    const ::addrinfo& addr) {
  int r = ::connect(socket_->handle(), addr.ai_addr, addr.ai_addrlen);

Then, I found that the error is caused by the function connect. To debug easily, I extract this code into a cpp file, to find out why the address I passed is an invalid argument. The test .cpp script is shown below:

#include <iostream>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <unistd.h>
#include <cstring>
#include <net/if.h>
int main() {
    int sockfd = socket(AF_INET6, SOCK_STREAM, 0);
    if (sockfd == -1) {
        std::cerr << "Socket creation failed." << std::endl;
        return 1;
    }
    sockaddr_in6 serverAddr;
    serverAddr.sin6_family = AF_INET6;
    serverAddr.sin6_port = htons(29499); // Port number
    //serverAddr.sin6_scope_id = if_nametoindex("eno1");
    if (inet_pton(AF_INET6, "fe80::4315:8136:2e6:13f8", &serverAddr.sin6_addr) <= 0) {
        std::cout << "Invalid IPv6 address." << std::endl;
        return 1;
    }

    if (connect(sockfd, (struct sockaddr*)&serverAddr, sizeof(serverAddr)) == -1) {
        perror("Connect failed");
        return 1;
    }
    std::cout << "Connected to the server!" << std::endl;
    // Close the socket
    close(sockfd);
    return 0;
}

In this script, I tried a different ipv6 address "2001:0db8:85a3:0000:0000:8a2e:0370:7334", this time, it gave me a different new error

Connect failed: Network is unreachable

Suddenly I realized that the IP address I passed may have something special that caused this error. Then, I found the IPV6 address is a subnet address started with fe80. The I tried other ip addresses not started with fe80, the invalid argument error disappeared, the new error is "network is unreachable". I think the current code can not support a subnet ipv6 address. Therefore, I add this line to the code:

serverAddr.sin6_scope_id = if_nametoindex("eno1");

Run the .cpp script again, the result is shown as below:

Connect failed: Connection refused

At this step, I realized that I passed the verfication step, and the socket tried to connect the ip but been refused. Thus, I changed the origin code in torch in socket.cpp like this: image Then, I compile the code, and run the elastic launch command again. This time, my training script started run and finished. Thus, I concluded that the source code does not support a subnet ipv6 address

Versions

[pip3] flake8==6.0.0 [pip3] flake8-bugbear==23.3.23 [pip3] flake8-comprehensions==3.11.1 [pip3] flake8-executable==2.1.3 [pip3] flake8-logging-format==0.9.0 [pip3] flake8-pyi==23.3.1 [pip3] flake8-simplify==0.19.3 [pip3] mypy==0.960 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.23.1 [pip3] torch==2.1.0.dev20230810+cpu [pip3] torchvision==0.15.1 [conda] numpy 1.23.1 pypi_0 pypi [conda] torchvision 0.15.1 pypi_0 pypi

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu

tringwald commented 1 year ago

I guess the main problem is that an IPv6 link-local address is not (globally) unique. In fact, a single machine can reach the same link-local address via different network interfaces, which may or may not represent the same target machine. If a machine has two network interfaces, there could be a fe80::1111 on the network segment connected to eth0 and a fe80::1111 on eth1. That's why you would need to specify a zone index (e.g. fe80::1111%eth0) on each machine trying to connect. But the zone index might be different for every client, so a link-local address is not really useful as a rendezvous endpoint.

QSHLGZ commented 1 year ago

I guess the main problem is that an IPv6 link-local address is not (globally) unique. In fact, a single machine can reach the same link-local address via different network interfaces, which may or may not represent the same target machine. If a machine has two network interfaces, there could be a fe80::1111 on the network segment connected to eth0 and a fe80::1111 on eth1. That's why you would need to specify a zone index (e.g. fe80::1111%eth0) on each machine trying to connect. But the zone index might be different for every client, so a link-local address is not really useful as a rendezvous endpoint.

fe80::1111%eth0 can not be parsed in torch.distributed.launch

tringwald commented 1 year ago

I guess the main problem is that an IPv6 link-local address is not (globally) unique. In fact, a single machine can reach the same link-local address via different network interfaces, which may or may not represent the same target machine. If a machine has two network interfaces, there could be a fe80::1111 on the network segment connected to eth0 and a fe80::1111 on eth1. That's why you would need to specify a zone index (e.g. fe80::1111%eth0) on each machine trying to connect. But the zone index might be different for every client, so a link-local address is not really useful as a rendezvous endpoint.

fe80::1111%eth0 can not be parsed in torch.distributed.launch

Right now it cannot be parsed, I'm just saying that it would be necessary. It would also be essential to provide a zone index for all of the connecting clients, which is probably why it isn't supported.

QSHLGZ commented 1 year ago

I think it does not need to provide a zone index for all of the connecting clients. Only zone index for master port is needed.

I guess the main problem is that an IPv6 link-local address is not (globally) unique. In fact, a single machine can reach the same link-local address via different network interfaces, which may or may not represent the same target machine. If a machine has two network interfaces, there could be a fe80::1111 on the network segment connected to eth0 and a fe80::1111 on eth1. That's why you would need to specify a zone index (e.g. fe80::1111%eth0) on each machine trying to connect. But the zone index might be different for every client, so a link-local address is not really useful as a rendezvous endpoint.

fe80::1111%eth0 can not be parsed in torch.distributed.launch

Right now it cannot be parsed, I'm just saying that it would be necessary. It would also be essential to provide a zone index for all of the connecting clients, which is probably why it isn't supported.

I think it does not need to provide a zone index for all of the connecting clients. Only zone index for master port is needed. In the code I showed above, I hardcode the zone index as "eno1", which is the zone index name of the master node, but the zone indexs of other nodes are not equal to "eno1". However, the multinodes training still works which can support the argument only zone index of the master port is needed. In theory, this also makes sense. The other nodes, when communicating with the master node, only need to know the relevant network information of the master node then to connect to the master node.

tringwald commented 1 year ago

The example in the OP is executed with --nnodes=1 --nproc_per_node=4, so only a single node is used. All processes on that node default to the hardcoded rendezvous endpoint fe80::4315:8136:2e6:13f8%eno1. If there was a second node, it'd also need to connect to fe80::4315:8136:2e6:13f8 but with a zone index that corresponds to the network interface of the second node that can reach the rendezvous endpoint of the master node. However, the second node might not even have an interface eno1, but maybe eth0 instead[^1]. So now, you'd need to provide a zone index for every client, which is highly impractical. Using a globally routable IPv6 (or IPv4) is just way easier and doesn't require further configuration.

[^1]: If the second node also has a network interface eno1 that can reach the link-local rendezvous address of the master node, this might actually work.

QSHLGZ commented 1 year ago

The example in the OP is executed with --nnodes=1 --nproc_per_node=4, so only a single node is used. All processes on that node default to the hardcoded rendezvous endpoint fe80::4315:8136:2e6:13f8%eno1. If there was a second node, it'd also need to connect to fe80::4315:8136:2e6:13f8 but with a zone index that corresponds to the network interface of the second node that can reach the rendezvous endpoint of the master node. However, the second node might not even have an interface eno1, but maybe eth0 instead1. So now, you'd need to provide a zone index for every client, which is highly impractical. Using a globally routable IPv6 (or IPv4) is just way easier and doesn't require further configuration.

Footnotes

  1. If the second node also has a network interface eno1 that can reach the link-local rendezvous address of the master node, this might actually work.

Thanks for you reply, I thought I have tried 2 nodes training, but actually not. I will try to do a two-nodes training to verify that.