Open bradfitz opened 3 years ago
Potential fast fix for this customer: port hint from control protocol. We send them an unstable build and see if a random port hint fixes it. Thoughts?
That's still a bit of work in magicsock even for that. We don't currently have a way to change the magicsock.Options.Port at runtime.
Ah. So I guess the fast experiment is to ship them an unstable that by default uses a random port.
For the real fix, do we want to make this a control setting? The only reason we use a fixed port is because some admins want to watch the traffic around their network, so that opens up the possibility that we could default to a random port, and let admins choose a fixed port if they want.
Ah. So I guess the fast experiment is to ship them an unstable that by default uses a random port.
Doing it by default for everybody? That risks breaking people who are currently depending on the fixed port.
e.g. https://tailscale.com/kb/1082/firewall-ports/ already says to open up 41641.
I think the right fix is to try to use 41641 and fall back to random if it doesn't work.
I think the fast experiment is to ship this one customer a binary that does it properly and see what happens.
For the right fix, how do you detect 41641 doesn't work? Right now we see two separate nodes, both connected, reporting they are on the endpoint
Presumably in the linked case, if the upstream NAT simply never let packets through 41641 for a client, then it would fall back to DERP and everything would be fine, if slow. Instead we see unreachable machines, suggesting the NAT let traffic through for a short amount of time then changed its mind when another tailscale client behind the NAT caused it to reset.
My guess is they are bouncing around in the NAT's tiny routing memory very quickly.
On the receive side, it seems relatively straightforward to always open two magicsock ports and report them both to control, one 41641, the other random.
The problem is, if the 41641 port breaks behind bad NATs after a few packets, how does the sending magicsock decide to switch quickly from 41641, which was working, to the other port?
Maybe we should prefer the random port over 41641? Then for hard NATs, where opening a firewall port for 41641 is useful, the first random port will always fail hard, and it will fall back to using 41641?
Related: https://github.com/tailscale/tailscale/issues/2331 (particularly mention of SO_MARK)
Tailscale usually defaults to port 41641, which people generally like for documentation and ease-of-recognition reasons.
But sometimes dinky NAT devices get overwhelmed when even as many as small handful of clients are all trying to use that port.
Then magicsock probes can report no UDP connectivity.
Magicsock could instead, when a fixed port is requested, listen also on
":0"
and fall back to using it only when the fixed port seems broken.