abbradar commented 1 year ago

What is the issue?

I'm experimenting with the Tailscale networks, end goal is to determine if it's suitable for our organization. Right now we are trying Headscale because of the restrictions I describe below. AFAIU this shouldn't matter for the purposes of the issue.

Some of my hosts have a whitelist-based firewall due to regulations. I have custom DERP nodes which are whitelisted. A restricted host uses them to connect to other hosts, and everything works correctly this far.

However, when I add another non-whitelisted DERP node to the map, I can't connect to the hosts that choose it as their home DERP node from a restricted host. Looking at the traffic, it seems that Tailscale tries (unsuccessfully) to connect to the non-whitelisted DERP node STUN and TLS ports, without trying other connection routes. When I disable the non-whitelisted DERP node, everything works correctly.

I'm not sure if it's a misconfiguration on my side, a bug or if Tailscale doesn't support this kind of setup right now. I tried to find a technical description of how the connection is established, but failed — maybe someone can point me at it?

I thought of solving this with HTTPS proxies, this should also allow us to move to the official Tailscale servers. Even though, it would be nice to get a confirmation that's the way the protocol works.

Thanks!

Steps to reproduce

Setup two custom DERP nodes DerpA and DerpB. Disable the default regions;
Setup two hosts, HostA and HostB. Setup a firewall on HostA whitelisting only DerpA. HostB must be located close to the non-whitelisted DerpB;
Check that HostA has DerpA as the home DERP node, and HostB has DerpB;
On HostA: ping hostb.

Observe that ping doesn't work. Observe traffic on HostA; notice that it tries (unsuccessfully) to connect to DerpB.

Are there any recent changes that introduced the issue?

No response

OS

Linux

OS version

NixOS 23.11

Tailscale version

1.46.1

Other software

Headscale 0.22.3

Bug report

-

alexl4321 commented 1 year ago

can confirm. if a node is able to ping a derp server but has no connectivity to it (ipv4, ipv6 issues e.g.) it will still think its active and try to use the derp and traffic will be dead.

irrespectable if another "actually working" derp is available.

bradfitz commented 1 year ago

Every Tailscale node in the tailnet gets the same derp map and needs to be able to connect to all the derper nodes in the derp map. (It's okay for them to pick different homes but they all need to be reachable from everything) Tailscale doesn't currently try to find a common node that both sides can both reach if there's blocking taking place.

alexl4321 commented 1 year ago

the issue is not one node not being able to connect or anything. it literally breaks the tailnnet if there is any issue on the closest derp.

how i found this out:

lots of tailnet nodes close to problematic derp
derp latency checks still worked
ipv6 still worked so all ipv6 nodes managed to still use the node
ipv4 udp was broken (#9525 )

ipv4 only nodes (that need derp routing) would still see the "broken" derp as active / working and would try to use it - completely braking tailscale connectivity for those nodes - not choosing the next "working" derp same goes for similar nodes that dont need derp routing but derp "direct connection finding". they would not be able to do that anymore thus loosing connectivity

i managed that in various ways e.g. completely blocking certain ports, udp etc traffic on that derp server in general.

as long as the latency checks or whatever tailscale does still worked and placed it as nearest derp, anything breaking the actual data comms (when a derp is needed) will leave those nodes without connectivity / not choose the next healthy derp

bradfitz commented 1 year ago

Yes, a derper is expected to be working if it's responding to any traffic. How did you get into a state where it was only half working?

alexl4321 commented 1 year ago

9525 is how i noticed it. somehow ipv4 udp traffic has an issue when being run inside lxd containers. so i did more tests on a variety of hosts and different pv4/ipv6/tcp/udp configurations. thats how i noticed that a derp server can be marked active through e.g. one protocol or the latency check but not be working otherwise and thus killing all nodes depending on it and having it as no1 because of latency

bradfitz commented 1 year ago

@alexl4321, I think you have a wrong mental model of how DERP works that's leading you to jump to some weird debugging/conclusions.

My reply above (https://github.com/tailscale/tailscale/issues/9524#issuecomment-1732556023) is in response to @abbradar 's original problem text. If a tailscale node picks a DERP home, that DERP node needs to be accessible to any node it wishes it to talk to. That's their rendezvous point side channel to bootstrap the NAT traversal. If one side is blocked from accessing it, they can't start the NAT traversal.

How nodes choose their DERP home is based on latency (using STUN UDP, ICMP, or HTTPS measurements). Even if UDP is used for the STUN latency measurements, UDP isn't used by DERP otherwise. It's all over TCP/443 currently.

alexl4321 commented 1 year ago

@bradfitz yeah i understand that. thing is imo it is still a bug that if for some reason there is a bug on that derp or connectivity on one port gets cut (to everyone not just one side of the equation) the clients should jump to the next derp and not loose connectivity.

1) 10 clients choose DERP-A as nearest derp while it is fully functional 2) DERP-A experiences some connectivity issue that blocks IPV4 traffic to the 3478 udp port 3) clients will not notice that its not working anymore since they can still icmp it / http / https it and latency checks work 4) clients will try to send traffic through it / find direct connects through it 5) clients will be dead

in this case it was ipv4 only clients ipv6 clients managed to use ipv6 but it suppose one could do same other way round or even cut of both ports of the derp (to everyone - every client not just one sided)

at least that is the scenario that i have seen. unless it i mixed something up. i will see if i can replicate it again next days.

tailscale / tailscale

Tailscale can't work around a blocked DERP node #9524

What is the issue?

Steps to reproduce

Are there any recent changes that introduced the issue?

OS

OS version

Tailscale version

Other software

Bug report