n0-computer / iroh

peer-2-peer that just works
https://iroh.computer
Apache License 2.0
2.63k stars 166 forks source link

Enhancing iroh's Hole Punching Success Rate #2317

Closed zh522130 closed 1 month ago

zh522130 commented 6 months ago

My test environment includes two computers, C1 and C2, both equipped with dynamic public IP addresses, situated behind a home router that has UPNP enabled. Additionally, there are two mobile phones, P1 and P2, using mobile networks.

When testing with Tailscale, except for the requirement for mobile phones P1 and P2 to connect through a relay, all other device-to-device connections are made directly.

When I conduct tests using iroh, direct connections are only achievable when both computers C1 and C2 are using networks with dynamic public IP addresses. If I connect either C1 or C2 to the network via a shared connection from a mobile phone, the connection must be established through a relay.

I performed the same test with Tailscale, and even if one end is connected via a shared mobile network, as long as one end is on a network with a dynamic public IP, there is only a short initial period where the connection might go through a relay before quickly switching to a direct connection.

My test code uses DnsDiscovery and connects via endpoint.connect_by_node_id. I noticed that the client reports an error every time it starts: ERROR mainline::rpc: Could not bootstrap the routing table. I implemented a loop that sends messages from the client to the server and back every second, while also printing the current connection type (conn_type). The printed information is as follows: connect type: Mixed(10.0.0.222:56394, RelayUrl("https://xxxxx.:12346/")), where the IP part alternates between a local network IP and a public IP.

flub commented 6 months ago

Thanks for the report! I love an opportunity to improve our holepunching from real-world situations.

My test environment includes two computers, C1 and C2, both equipped with dynamic public IP addresses, situated behind a home router that has UPNP enabled. Additionally, there are two mobile phones, P1 and P2, using mobile networks.

When testing with Tailscale, except for the requirement for mobile phones P1 and P2 to connect through a relay, all other device-to-device connections are made directly.

When I conduct tests using iroh, direct connections are only achievable when both computers C1 and C2 are using networks with dynamic public IP addresses. If I connect either C1 or C2 to the network via a shared connection from a mobile phone, the connection must be established through a relay.

I'm not sure I fully understand the network layout yet. Are C1 & C2 on the same local network or different ones?

What works: C1 -> NAT -> NAT -> C2 (and reverse)

What doesn't work: C1 -> P1 -> NAT -> C2 (and reverse) C1 -> P1 -> P2 -> C2 (and reverse)

Is that correct? I assume C1 -> P1 is done by P1 having a wifi hotspot and thus acting as a router and C1 being connected to the hotspot's subnet? So when both P1 & P2 are used there are two different subnets?

I noticed that the client reports an error every time it starts: ERROR mainline::rpc: Could not bootstrap the routing table.

This part is an unfortunate mistake in the current release. It should never try to connect to the mainline DHT and this is fixed for the next release. So this error is entirely harmless and does not affect any functionality.

I implemented a loop that sends messages from the client to the server and back every second, while also printing the current connection type (conn_type). The printed information is as follows: connect type: Mixed(10.0.0.222:56394, RelayUrl("https://xxxxx.:12346/")), where the IP part alternates between a local network IP and a public IP.

Could you make the test code available? I also would really appreciate it if you could provide us with the full DEBUG-level logs of both nodes. If you prefer to share those more privately you could also email them or something else you're comfortable with. I would be great if we could work together to improve the holepunching for your situation!

zh522130 commented 6 months ago

@flub Your grasp of the network topology is spot on; C1 and C2 are on separate networks, each with its own dynamic public IP. I have not conducted tests for the scenario C1 -> P1 -> P2 -> C2, as in this setup, Tailscale would also use a relay connection.

I have attached the code and logs to this message, where I have replaced some real IPs and domain names for confidentiality. Additionally, I've included an article on NAT hole punching that I recently encountered, and I'm uncertain whether it will aid in enhancing the efficiency of hole punching.

code and log.zip A New Method for Symmetric NAT Traversal in UDP and TCP (1).pdf

zh522130 commented 6 months ago

For the C1 -> P1 -> NAT -> C2 scenario, I've looked into Tailscale's direct connections, and indeed, they use UDP. My case is unique because the C2 router has UPnP enabled and a public IP. I think the simplest method might be for C2 to open a TCP port through UPnP, allowing C1 to initiate a direct TCP connection. Regarding the inability of two mobile phones, P1 and P2, to connect directly, in addition to using a server relay, openp2p-cn can the use of a client with a public IP and a router that supports automatic port mapping to act as a relay (with the client user's consent). This approach can help overcome the issues of high costs associated with central server resources and significant latency due to being far from the central server(I'm in the China, far away from your servers, haha. For self-host , using a home broadband-connected client as a relay server can also save a significant amount of costs.).However, this project is in a semi-open state; the client is continuously being updated, but the source code has not been updated further. I'm not promoting this project; I just think this approach is quite good for self-host.

zh522130 commented 6 months ago

I spent 4 hours retesting because I found that iroh doctor could also establish a direct connection in the network topology of C1 -> P -> NAT -> C2. The test results showed that the success rate of hole punching was related to my mobile phones.

I have two phones, Xiaomi and iPhone. When uses the iPhone to share Wi-Fi, iroh doctor can establish a direct connection every time. However, using the Xiaomi phone depends on luck; sometimes it works, sometimes it doesn't. During each test, I might switch networks, connect to a Tailscale for remote, and then restart server(if Tailscale is turned on, iroh will use Tailscale's direct connection, so I turn it off during testing). I'm not sure what specific operation suddenly enables a direct connection, but once a direct connection is established, it can be repeated every time (I continue to test several times after a successful direct connection). I switched phones many times and the phenomenon remained related to the phone.

The test code I previously uploaded often crashed automatically. So today, I used iroh-net/examples/listen.rs and connect.rs for testing, and the results matched those of the iroh doctor tests. However, Tailscale does not have this issue; after switching networks, regardless of which phone shares the Wi-Fi, Tailscale only needs to ping a few times to switch from relay to direct connection.

zh522130 commented 6 months ago

There has been new progress in the testing, and the situation is somewhat complex. The test involves 2 routers, 3 computers, and 2 phones.

In the previous tests, UPnP was always turned on, but I found that there seem to be some issues with the UPnP port mapping on the router I have been consistently testing with. I'm not sure if it's a router issue or a problem with igd_next, as I keep getting timeout errors when running the igd_next example code. The iroh log also shows upnp probe failed: IO error: search timed out. However, I did see some UPnP ports mapped by iroh in the router interface, but not the current ports in use by iroh. The other router, in contrast, maps ports much faster.

The conclusion is: in my network environment, the success rate of hole punching is highly dependent on whether UPnP is enabled or if there is a public IPv6 address available.

UPnP turned on (mapping works as expected): C1 -> Phone(share wifi) -> Router-> C2 always results in successful hole punching.even without a public IPv6 address.

With UPnP turned off (both the Router and Phone have public IPv6, and both C1 and C2 have IPv6 enabled): C1 -> Phone (sharing WiFi) -> Router -> C2 always successfully hole punching via IPv6.

~~With UPnP turned off (either the Router or Phone lacks a public IPv6, or either C1 or C2 has IPv6 disabled): The sequence C1 -> Phone (sharing WiFi) -> Router -> C2 always fails in hole punching and can only use a relay.~~

The retesting has once again produced different results. In the morning, with UPnP turned off and no IPv6, several attempts to establish a connection through mobile hotspot sharing failed. In the afternoon, even with UPnP still disabled and no IPv6, the results mirrored those from a few days prior; the iPhone hotspot was capable of establishing a connection, while the Xiaomi phone was not. I'm not sure what the issue is; the testing has been paused for now without any clear understanding or identifiable pattern.

flub commented 6 months ago

Apologies I still haven't been able to find the time to dig into your reports here. With some luck I might next week.

In theory iroh should be fine with or without upnp working, but it is true that most places don't have upnp working so upnp probably still has more bugs.

zh522130 commented 6 months ago

Apologies I still haven't been able to find the time to dig into your reports here. With some luck I might next week.

Although the test results have shown some inconsistencies, sharing them might still provide some reference (though it could also lead to confusion).

If subsequent optimization of this issue requires testing, I can help.

In theory iroh should be fine with or without upnp working,

Indeed, it should be so.

flub commented 4 months ago

in addition to using a server relay, openp2p-cn can the use of a client with a public IP and a router that supports automatic port mapping to act as a relay (with the client user's consent). This approach can help overcome the issues of high costs associated with central server resources and significant latency due to being far from the central server(I'm in the China, far away from your servers, haha. For self-host , using a home broadband-connected client as a relay server can also save a significant amount of costs.)

Our approach has been consciously to not do this automatically on all clients, or integrate it otherwise into a normal client. Instead we decided to let users who have the ability and want to run a relay do this explicitly by running their own relay server.

Everyone can run a relay server, and use it together with other relay servers. You need to add it to your Relay Map in the client configuration. The relay servers do not need to be aware of each other. We also publish the iroh-relay binary in our releases for this purpose.

Maybe this creates slightly more friction to running a relay server, but it is an important component to the connectivity, and letting any client participate as a relay would not result in the desired reliability and uptime for our goals. So we feel this extra friction is worth it.

flub commented 4 months ago

A New Method for Symmetric NAT Traversal in UDP and TCP (1).pdf

The holepunching system we used based on DERP (and I believe also ICE) employs a coordination server to send traffic from both clients at the same time, with the help of some information gained by STUN. Thus taking away the need to guess the ports and addresses used by the NATs. It deals with symmetrics NATs reliably. We should write down how we do this ourselves sometime, but it still is getting tweaks so might be a bit early. In the meantime https://tailscale.com/blog/how-nat-traversal-works is probably the best description and includes some good overview of NATs in todays world.

flub commented 4 months ago

I've finally looked through your logs properly. Apologies for taking so long to get back to this. I found one bug because some weird thing in your logs - but it won't fix your issue.

Otherwise the logs look fine: there are coordinated holepunching attempts. But nothing makes it through. One explanation could be that something is filtering your network. However another option could be that your network is a bit lossy. You do mention mobile phones and that it changes rather randomly, maybe even time of the day. If it starts being less reliable at busy times it might be because some packets are dropped.

So this issue made me realise (again probably) our holepunching is rather vulnerable to packet loss. This is certainly something we should figure out how to improve on.

flub commented 4 months ago

I created #2481 to track the packet-loss problem during holepunching.

zh522130 commented 4 months ago

Otherwise the logs look fine: there are coordinated holepunching attempts. But nothing makes it through. One explanation could be that something is filtering your network. However another option could be that your network is a bit lossy. You do mention mobile phones and that it changes rather randomly, maybe even time of the day. If it starts being less reliable at busy times it might be because some packets are dropped.

Perhaps your guess is correct; it might indeed be a network issue on my end. I have been consistently able to successfully establish a connection when one end of the network is a mobile hotspot for the past week.

zh522130 commented 4 months ago

Everyone can run a relay server, and use it together with other relay servers. You need to add it to your Relay Map in the client configuration. The relay servers do not need to be aware of each other. We also publish the iroh-relay binary in our releases for this purpose.

Maybe this creates slightly more friction to running a relay server, but it is an important component to the connectivity, and letting any client participate as a relay would not result in the desired reliability and uptime for our goals. So we feel this extra friction is worth it.

A client functioning as a relay differs from a standard relay service; it involves more effort and may not guarantee stability (e.g., when the client is shut down). This might be best implemented by users themselves according to their specific requirements.

flub commented 4 months ago

Otherwise the logs look fine: there are coordinated holepunching attempts. But nothing makes it through. One explanation could be that something is filtering your network. However another option could be that your network is a bit lossy. You do mention mobile phones and that it changes rather randomly, maybe even time of the day. If it starts being less reliable at busy times it might be because some packets are dropped.

Perhaps your guess is correct; it might indeed be a network issue on my end. I have been consistently able to successfully establish a connection when one end of the network is a mobile hotspot for the past week.

To be clear, packet loss doesn't mean there's an issue with your network. It is entirely normal for networks to lose packets, especially when there's congestion.

Since iroh continues to try holepunching every 5 seconds it would be interesting to see if it eventually succeeds, maybe after a long time. But even so, I don't expect this to be guaranteed to work eventually.

zh522130 commented 4 months ago

In #2480, I mentioned that the server has multiple IP addresses, and I'm not entirely sure if the slow hole punching is related to address selection. My code prints the current ConnectionType after each data transmission. From the observations, for the slow hole punching scenarios, the connection type goes through Relay -> Mixed -> Direct stages. I noticed that the IP in the Mixed stage changes (sometimes it remains unchanged), and it seems to print LAN IPs during the Mixed stage. The correct hole punching IP only appears when it's already in Direct type, so I thought it was because the right address was selected that hole punching succeeded. After reading the documentation for the Mixed type, I understood it as the Mixed type sending data through relay while also using UDP addresses to send data (attempting hole punching). Is this understanding incorrect?

zh522130 commented 4 months ago

I conducted a small experiment where I modified the following code in NodeState -> fn addr_for_send:

let typ = match (best_addr, relay_url.clone()) {
    (Some(best_addr), Some(relay_url)) => ConnectionType::Mixed(best_addr, relay_url),
    (Some(best_addr), None) => ConnectionType::Direct(best_addr),
    (None, Some(relay_url)) => ConnectionType::Relay(relay_url),
    (None, None) => ConnectionType::None,
};
let typ = match typ {
    ConnectionType::Mixed(addr, relay_url) => ConnectionType::Mixed(best_addr.unwrap(), relay_url),
    _ => typ,
};
if self.conn_type.update(typ).is_ok() {
    let typ = self.conn_type.get();
    info!(%typ, "new connection type");
}
(best_addr, relay_url)

to:

let typ = match (best_addr, relay_url.clone()) {
    (Some(best_addr), Some(relay_url)) => ConnectionType::Mixed(best_addr, relay_url),
    (Some(best_addr), None) => ConnectionType::Direct(best_addr),
    (None, Some(relay_url)) => ConnectionType::Relay(relay_url),
    (None, None) => ConnectionType::None,
};

let best_addr: SocketAddr = "[2408:843f:1800:880f:8367:751d:96b6:fb3e]:35298".parse().unwrap();
let best_addr = Some(best_addr);
let typ = match typ {
    ConnectionType::Mixed(addr, relay_url) => ConnectionType::Mixed(best_addr.unwrap(), relay_url),
    _ => typ,
};
if self.conn_type.update(typ).is_ok() {
    let typ = self.conn_type.get();
    info!(%typ, "new connection type");
}
(best_addr, relay_url)

By forcing the specified hole punching IP, the hole punching succeeded quickly. I tested this 10 times, and without forcing the IP, only one attempt was fast. With the forced IP, all hole punching attempts were quick. I speculate that if the correct IP can be selected quickly here, the hole punching process will be faster.

zh522130 commented 4 months ago
let best_addr: SocketAddr = "[2408:843f:1800:880f:8367:751d:96b6:fb3e]:35298".parse().unwrap();
let best_addr = Some(best_addr);

The only useful part is the two lines of code above, which provide the correct IP and port from the beginning. By the way, my two computers haven't been able to successfully punch through all day today; they are behind routers with different public IPs.

flub commented 4 months ago

This has led to another issue: when my server has multiple IP addresses, only one of them can successfully punch through (in my tests, it's the IPV6 address), with most being LAN addresses. Due to the random selection of direct connection addresses currently, the probability of the correct IP address for successful punching being chosen is very low, relying entirely on luck. Sometimes, it takes a long time before my IPV6 is used for testing (I even began to wonder if the protocol dislikes IPV6, haha). I have attached a complete log. In my opinion, for early punching, LAN IPs might need a higher priority, but since LAN success rates are already high, if a LAN connection is not successful within a certain time (as soon as possible), more opportunities for punching through with public IPs should be given. This is also related to the https://github.com/n0-computer/iroh/issues/2317

logs_1720497577.log

Checking this log file (from #2480) I don't see anything wrong again and mostly suspect this is due to packet loss. It's good to know you have so much trouble with this, would be good to have an idea of how widespread this is.

zh522130 commented 4 months ago

By the way, my two computers haven't been able to successfully punch through all day today; they are behind routers with different public IPs.

Apologies, it seems that the issue with the computers behind the two routers failing to punch through consistently yesterday was due to my relay configuration. https://github.com/n0-computer/iroh/issues/2490#issuecomment-2224252792

Today, I switched to the default relay and was able to successfully punch through. However, this does not conflict with the logs I uploaded. When I uploaded the logs, I was using the correct relay.

flub commented 1 month ago

Hi, I have forgotten completely about tracking this issue by now. Do we need to still fix things here or can it be closed by now?

zh522130 commented 1 month ago

I haven't done much testing recently, but I remember the current punching success rate is quite high. I'll close it for now, and reopen it if I notice any issues later.