tailscale / tailscale

The easiest, most secure way to use WireGuard and 2FA.
https://tailscale.com
BSD 3-Clause "New" or "Revised" License
18.61k stars 1.43k forks source link

magicsock: use multiple UDP sockets #2187

Open bradfitz opened 3 years ago

bradfitz commented 3 years ago

Tailscale usually defaults to port 41641, which people generally like for documentation and ease-of-recognition reasons.

But sometimes dinky NAT devices get overwhelmed when even as many as small handful of clients are all trying to use that port.

Then magicsock probes can report no UDP connectivity.

Magicsock could instead, when a fixed port is requested, listen also on ":0" and fall back to using it only when the fixed port seems broken.

crawshaw commented 3 years ago

Potential fast fix for this customer: port hint from control protocol. We send them an unstable build and see if a random port hint fixes it. Thoughts?

bradfitz commented 3 years ago

That's still a bit of work in magicsock even for that. We don't currently have a way to change the magicsock.Options.Port at runtime.

crawshaw commented 3 years ago

Ah. So I guess the fast experiment is to ship them an unstable that by default uses a random port.

For the real fix, do we want to make this a control setting? The only reason we use a fixed port is because some admins want to watch the traffic around their network, so that opens up the possibility that we could default to a random port, and let admins choose a fixed port if they want.

bradfitz commented 3 years ago

Ah. So I guess the fast experiment is to ship them an unstable that by default uses a random port.

Doing it by default for everybody? That risks breaking people who are currently depending on the fixed port.

e.g. https://tailscale.com/kb/1082/firewall-ports/ already says to open up 41641.

I think the right fix is to try to use 41641 and fall back to random if it doesn't work.

crawshaw commented 3 years ago

I think the fast experiment is to ship this one customer a binary that does it properly and see what happens.

For the right fix, how do you detect 41641 doesn't work? Right now we see two separate nodes, both connected, reporting they are on the endpoint :41641. One of them even reports another endpoint on the same NATted IP with a different port. The breakage is an upstream router, so magicsock is going to have to receive some information from some network prober or the control plane.

crawshaw commented 3 years ago

Presumably in the linked case, if the upstream NAT simply never let packets through 41641 for a client, then it would fall back to DERP and everything would be fine, if slow. Instead we see unreachable machines, suggesting the NAT let traffic through for a short amount of time then changed its mind when another tailscale client behind the NAT caused it to reset.

My guess is they are bouncing around in the NAT's tiny routing memory very quickly.

crawshaw commented 3 years ago

On the receive side, it seems relatively straightforward to always open two magicsock ports and report them both to control, one 41641, the other random.

The problem is, if the 41641 port breaks behind bad NATs after a few packets, how does the sending magicsock decide to switch quickly from 41641, which was working, to the other port?

Maybe we should prefer the random port over 41641? Then for hard NATs, where opening a firewall port for 41641 is useful, the first random port will always fail hard, and it will fall back to using 41641?

DentonGentry commented 1 year ago

Related: https://github.com/tailscale/tailscale/issues/2331 (particularly mention of SO_MARK)