n0-computer / iroh

A toolkit for building distributed applications
https://iroh.computer
Apache License 2.0
2.11k stars 137 forks source link

Enhancing iroh's Hole Punching Success Rate #2317

Open zh522130 opened 1 month ago

zh522130 commented 1 month ago

My test environment includes two computers, C1 and C2, both equipped with dynamic public IP addresses, situated behind a home router that has UPNP enabled. Additionally, there are two mobile phones, P1 and P2, using mobile networks.

When testing with Tailscale, except for the requirement for mobile phones P1 and P2 to connect through a relay, all other device-to-device connections are made directly.

When I conduct tests using iroh, direct connections are only achievable when both computers C1 and C2 are using networks with dynamic public IP addresses. If I connect either C1 or C2 to the network via a shared connection from a mobile phone, the connection must be established through a relay.

I performed the same test with Tailscale, and even if one end is connected via a shared mobile network, as long as one end is on a network with a dynamic public IP, there is only a short initial period where the connection might go through a relay before quickly switching to a direct connection.

My test code uses DnsDiscovery and connects via endpoint.connect_by_node_id. I noticed that the client reports an error every time it starts: ERROR mainline::rpc: Could not bootstrap the routing table. I implemented a loop that sends messages from the client to the server and back every second, while also printing the current connection type (conn_type). The printed information is as follows: connect type: Mixed(10.0.0.222:56394, RelayUrl("https://xxxxx.:12346/")), where the IP part alternates between a local network IP and a public IP.

flub commented 1 month ago

Thanks for the report! I love an opportunity to improve our holepunching from real-world situations.

My test environment includes two computers, C1 and C2, both equipped with dynamic public IP addresses, situated behind a home router that has UPNP enabled. Additionally, there are two mobile phones, P1 and P2, using mobile networks.

When testing with Tailscale, except for the requirement for mobile phones P1 and P2 to connect through a relay, all other device-to-device connections are made directly.

When I conduct tests using iroh, direct connections are only achievable when both computers C1 and C2 are using networks with dynamic public IP addresses. If I connect either C1 or C2 to the network via a shared connection from a mobile phone, the connection must be established through a relay.

I'm not sure I fully understand the network layout yet. Are C1 & C2 on the same local network or different ones?

What works: C1 -> NAT -> NAT -> C2 (and reverse)

What doesn't work: C1 -> P1 -> NAT -> C2 (and reverse) C1 -> P1 -> P2 -> C2 (and reverse)

Is that correct? I assume C1 -> P1 is done by P1 having a wifi hotspot and thus acting as a router and C1 being connected to the hotspot's subnet? So when both P1 & P2 are used there are two different subnets?

I noticed that the client reports an error every time it starts: ERROR mainline::rpc: Could not bootstrap the routing table.

This part is an unfortunate mistake in the current release. It should never try to connect to the mainline DHT and this is fixed for the next release. So this error is entirely harmless and does not affect any functionality.

I implemented a loop that sends messages from the client to the server and back every second, while also printing the current connection type (conn_type). The printed information is as follows: connect type: Mixed(10.0.0.222:56394, RelayUrl("https://xxxxx.:12346/")), where the IP part alternates between a local network IP and a public IP.

Could you make the test code available? I also would really appreciate it if you could provide us with the full DEBUG-level logs of both nodes. If you prefer to share those more privately you could also email them or something else you're comfortable with. I would be great if we could work together to improve the holepunching for your situation!

zh522130 commented 1 month ago

@flub Your grasp of the network topology is spot on; C1 and C2 are on separate networks, each with its own dynamic public IP. I have not conducted tests for the scenario C1 -> P1 -> P2 -> C2, as in this setup, Tailscale would also use a relay connection.

I have attached the code and logs to this message, where I have replaced some real IPs and domain names for confidentiality. Additionally, I've included an article on NAT hole punching that I recently encountered, and I'm uncertain whether it will aid in enhancing the efficiency of hole punching.

code and log.zip A New Method for Symmetric NAT Traversal in UDP and TCP (1).pdf

zh522130 commented 1 month ago

For the C1 -> P1 -> NAT -> C2 scenario, I've looked into Tailscale's direct connections, and indeed, they use UDP. My case is unique because the C2 router has UPnP enabled and a public IP. I think the simplest method might be for C2 to open a TCP port through UPnP, allowing C1 to initiate a direct TCP connection. Regarding the inability of two mobile phones, P1 and P2, to connect directly, in addition to using a server relay, openp2p-cn can the use of a client with a public IP and a router that supports automatic port mapping to act as a relay (with the client user's consent). This approach can help overcome the issues of high costs associated with central server resources and significant latency due to being far from the central server(I'm in the China, far away from your servers, haha. For self-host , using a home broadband-connected client as a relay server can also save a significant amount of costs.).However, this project is in a semi-open state; the client is continuously being updated, but the source code has not been updated further. I'm not promoting this project; I just think this approach is quite good for self-host.

zh522130 commented 1 month ago

I spent 4 hours retesting because I found that iroh doctor could also establish a direct connection in the network topology of C1 -> P -> NAT -> C2. The test results showed that the success rate of hole punching was related to my mobile phones.

I have two phones, Xiaomi and iPhone. When uses the iPhone to share Wi-Fi, iroh doctor can establish a direct connection every time. However, using the Xiaomi phone depends on luck; sometimes it works, sometimes it doesn't. During each test, I might switch networks, connect to a Tailscale for remote, and then restart server(if Tailscale is turned on, iroh will use Tailscale's direct connection, so I turn it off during testing). I'm not sure what specific operation suddenly enables a direct connection, but once a direct connection is established, it can be repeated every time (I continue to test several times after a successful direct connection). I switched phones many times and the phenomenon remained related to the phone.

The test code I previously uploaded often crashed automatically. So today, I used iroh-net/examples/listen.rs and connect.rs for testing, and the results matched those of the iroh doctor tests. However, Tailscale does not have this issue; after switching networks, regardless of which phone shares the Wi-Fi, Tailscale only needs to ping a few times to switch from relay to direct connection.

zh522130 commented 1 month ago

There has been new progress in the testing, and the situation is somewhat complex. The test involves 2 routers, 3 computers, and 2 phones.

In the previous tests, UPnP was always turned on, but I found that there seem to be some issues with the UPnP port mapping on the router I have been consistently testing with. I'm not sure if it's a router issue or a problem with igd_next, as I keep getting timeout errors when running the igd_next example code. The iroh log also shows upnp probe failed: IO error: search timed out. However, I did see some UPnP ports mapped by iroh in the router interface, but not the current ports in use by iroh. The other router, in contrast, maps ports much faster.

The conclusion is: in my network environment, the success rate of hole punching is highly dependent on whether UPnP is enabled or if there is a public IPv6 address available.

UPnP turned on (mapping works as expected): C1 -> Phone(share wifi) -> Router-> C2 always results in successful hole punching.even without a public IPv6 address.

With UPnP turned off (both the Router and Phone have public IPv6, and both C1 and C2 have IPv6 enabled): C1 -> Phone (sharing WiFi) -> Router -> C2 always successfully hole punching via IPv6.

~~With UPnP turned off (either the Router or Phone lacks a public IPv6, or either C1 or C2 has IPv6 disabled): The sequence C1 -> Phone (sharing WiFi) -> Router -> C2 always fails in hole punching and can only use a relay.~~

The retesting has once again produced different results. In the morning, with UPnP turned off and no IPv6, several attempts to establish a connection through mobile hotspot sharing failed. In the afternoon, even with UPnP still disabled and no IPv6, the results mirrored those from a few days prior; the iPhone hotspot was capable of establishing a connection, while the Xiaomi phone was not. I'm not sure what the issue is; the testing has been paused for now without any clear understanding or identifiable pattern.

flub commented 1 month ago

Apologies I still haven't been able to find the time to dig into your reports here. With some luck I might next week.

In theory iroh should be fine with or without upnp working, but it is true that most places don't have upnp working so upnp probably still has more bugs.

zh522130 commented 1 month ago

Apologies I still haven't been able to find the time to dig into your reports here. With some luck I might next week.

Although the test results have shown some inconsistencies, sharing them might still provide some reference (though it could also lead to confusion).

If subsequent optimization of this issue requires testing, I can help.

In theory iroh should be fine with or without upnp working,

Indeed, it should be so.