[Bug] v0.23.0-beta1 breaks built-in DERP

christian-heusel commented 1 month ago

Is this a support request?

[X] This is not a support request

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

$ tailscale status
100.64.0.1      pioneer              chris        linux   -
100.64.0.2      dj-magic-laserbrain  chris        linux   -
100.64.0.5      joeryzen             chris        linux   offline
100.64.0.4      meterpeter           chris        linux   -
100.64.0.6      scotty-the-fifth     chris        linux   idle, tx 4884 rx 0

# Health check:
#     - Tailscale could not connect to the 'Headscale Embedded DERP' relay server. Your Internet connection might be down, or the server might be temporarily unavailable.
#     - Tailscale could not connect to any relay server. Check your Internet connection.

Expected Behavior

The builtin DERP keeps on working with the update, I have also configured and used this setup for a long time now.

Steps To Reproduce

update headscale to version v0.23.0-beta1
observe that the builtin DERP stops working
revert back to v0.23.0-alpha12
observe that the DERP works again

I hope that I did not miss anything in the changelogs, but to me it looks like there was no config changes etc. required to keep this working between the two relevant versions.

Environment

- OS: Debian GNU/Linux trixie/sid 
- Headscale version: v0.23.0-beta1
- Tailscale version: 1.70.0

Runtime environment

[X] Headscale is behind a (reverse) proxy
[X] Headscale runs in a container

Although both of the above are the case the DERP server is just publicly accesible:

ports:
  - 0.0.0.0:3478:3478

Anything else?

The startup log claims that I do not have any DERP's configured:

headscale  | 2024-07-23T00:58:26Z INF Opening database database=sqlite3 path=/var/lib/headscale/db.sqlite
headscale  | 2024-07-23T00:58:26Z WRN DERP map is empty, not a single DERP map datasource was loaded correctly or contained a region
headscale  | 2024-07-23T00:58:26Z INF home/runner/work/headscale/headscale/hscontrol/derp/server/derp_server.go:103 > DERP region: {RegionID:999 RegionCode:christian-derp RegionName:Headscale Embedded DERP Latitude:0 Longitude:0 Avoid:false Nodes:[0xc00034b7a0]}
headscale  | 2024-07-23T00:58:26Z INF home/runner/work/headscale/headscale/hscontrol/derp/server/derp_server.go:104 > DERP Nodes[0]: &{Name:999 RegionID:999 HostName:vpn.heusel.eu CertName: IPv4: IPv6: STUNPort:3478 STUNOnly:false DERPPort:443 InsecureForTests:false STUNTestIP: CanPort80:false}
headscale  | 2024-07-23T00:58:26Z INF STUN server started at [::]:3478
headscale  | 2024-07-23T00:58:26Z INF Setting up a DERPMap update worker frequency=86400000

and yet this is my derp config (snippet), which used to work with the previous versions:

# DERP is a relay system that Tailscale uses when a direct
# connection cannot be established.
# https://tailscale.com/blog/how-tailscale-works/#encrypted-tcp-relays-derp
#
# headscale needs a list of DERP servers that can be presented
# to the clients.
derp:
  server:
    # If enabled, runs the embedded DERP server and merges it into the rest of the DERP config
    # The Headscale server_url defined above MUST be using https, DERP requires TLS to be in place
    enabled: true

    # Region ID to use for the embedded DERP server.
    # The local DERP prevails if the region ID collides with other region ID coming from
    # the regular DERP config.
    region_id: 999

    # Region code and name are displayed in the Tailscale UI to identify a DERP region
    region_code: "christian-derp"
    region_name: "Headscale Embedded DERP"

    # Listens over UDP at the configured address for STUN connections - to help with NAT traversal.
    # When the embedded DERP server is enabled stun_listen_addr MUST be defined.
    #
    # For more details on how this works, check this great article: https://tailscale.com/blog/how-tailscale-works/
    stun_listen_addr: "0.0.0.0:3478"

    private_key_path: /etc/headscale/derp_server_private.key

  # List of externally available DERP maps encoded in JSON
  # urls:
  #   - https://controlplane.tailscale.com/derpmap/default

  # Locally available DERP map files encoded in YAML
  #
  # This option is mostly interesting for people hosting
  # their own DERP servers:
  # https://tailscale.com/kb/1118/custom-derp-servers/
  #
  # paths:
  #   - /etc/headscale/derp-example.yaml
  paths: []

  # If enabled, a worker will be set up to periodically
  # refresh the given sources and update the derpmap
  # will be set up.
  auto_update_enabled: true

  # How often should we check for DERP updates?
  update_frequency: 24h

JohanVlugt commented 1 month ago

I think you forgot to add /udp in the docker compose.

This new beta update works for me without changing the setup.

    ports:
      - 3478:3478/udp

christian-heusel commented 1 month ago

Adding in the /udp did indeed solve the issue, but why did this work with the pre-beta versions? 🤔

Also should this maybe be in the upgrade documentation for the final release?

christian-heusel commented 1 month ago

Ah nevermind it just took tailscale status a moment to realize that the DERP is gone, changing the network config does not help for me 😅

kradalby commented 1 month ago

I'm having trouble reproducing this and all of the tests keep passing, it has me quite puzzled.

The error about empty DERP is only covering the DERP loaded via URL/file, so in this case it is displayed before the DERPs from the embedded server, and if there are no DERPs at all, the whole server will halt https://github.com/juanfont/headscale/blob/main/hscontrol/app.go#L516-L518.

Ah nevermind it just took tailscale status a moment to realize that the DERP is gone, changing the network config does not help for me 😅

Does this mean it was there initially, but then disappeared after?

kradalby commented 1 month ago

I've expanded the DERP tests a bit to ensure that the embedded server isnt removed by the updater in #2030.

# Health check:
#     - Tailscale could not connect to the 'Headscale Embedded DERP' relay server. Your Internet connection might be down, or the server might be temporarily unavailable.
#     - Tailscale could not connect to any relay server. Check your Internet connection.

So this makes me think that this is a networking issue, because headscale sends the DERP server as part of the map update. I cant really think of anything that would have changed this in the commits between the last alpha and the beta. Could there be an external event/change to your docker setup 🤔 (odd since reverting works).

I did notice this tho:

headscale  | 2024-07-23T00:58:26Z INF STUN server started at [::]:3478

This could indicate that it only listens to IPv6? however my test logs shows the same, so I would find it odd to be the cause, and I do not think anything related to that has changed.

christian-heusel commented 1 month ago

Does this mean it was there initially, but then disappeared after?

No the way I'm testing this is that I'm redeploying the other version on my VPS and then run tailscale status on my client to see if it's still working / printing out the error.

So this makes me think that this is a networking issue, because headscale sends the DERP server as part of the map update. I cant really think of anything that would have changed this in the commits between the last alpha and the beta. Could there be an external event/change to your docker setup 🤔 (odd since reverting works).

This was my first thought aswell, but the issue now reproduces over multirple docker versions and really consistently with every switch of images that I do.

This could indicate that it only listens to IPv6? however my test logs shows the same, so I would find it odd to be the cause, and I do not think anything related to that has changed.

After I have switched to the -debug version of the image I was able to check this inside of the container, and the outputs were the same for both versions:

/ # netstat -lntu | grep 3478
udp        0      0 :::3478                 :::*

$ ss -tulpn | grep 3478
udp   UNCONN 0      0            0.0.0.0:3478       0.0.0.0:*    users:(("docker-proxy",pid=5895,fd=4))                                      
udp   UNCONN 0      0               [::]:3478          [::]:*    users:(("docker-proxy",pid=5901,fd=4))

So since all of this did not help I also had a look at the output of tailscaled on my client and this looks interesting:

Jul 25 12:30:25 meterpeter tailscaled[131191]: derphttp.Client.Recv: connecting to derp-999 (christian-derp)
Jul 25 12:30:25 meterpeter tailscaled[131191]: magicsock: [0xc0035fd540] derp.Recv(derp-999): derphttp.Client.Recv connect to region 999 (christian-derp): dial tcp4: lookup vpn.heusel.eu: no such host
Jul 25 12:30:25 meterpeter tailscaled[131191]: netcheck: netcheck.runProbe: named node "999" has no v6 address
Jul 25 12:30:25 meterpeter tailscaled[131191]: netcheck: netcheck: DNS lookup error for "vpn.heusel.eu" (node "999" region 999): context canceled
Jul 25 12:30:25 meterpeter tailscaled[131191]: netcheck: netcheck.runProbe: named node "999" has no v4 address
Jul 25 12:30:27 meterpeter tailscaled[131191]: control: NetInfo: NetInfo{varies= hairpin= ipv6=false ipv6os=true udp=true icmpv4=false derp=#999 portmap=UC link="" firewallmode="ipt-default"}

So what actually seems to break is the internal DNS server (or something in that realm) and the DERP is just fallout from the before failure:

# alpha12

$ resolvectl status tailscale0         
Link 9 (tailscale0)
    Current Scopes: DNS
         Protocols: +DefaultRoute -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 100.100.100.100
       DNS Servers: 100.100.100.100
        DNS Domain: chris.vpn.heusel.eu ~.

$ resolvectl query --cache=NO vpn.heusel.eu
vpn.heusel.eu: 49.12.6.160                     -- link: tailscale0
               (christian.heusel.eu)

# extra records
$ resolvectl query --cache=NO grafana.vpn.heusel.eu
grafana.vpn.heusel.eu: 100.64.0.6              -- link: tailscale0

# node
$ resolvectl query --cache=NO scotty-the-fifth.chris.vpn.heusel.eu
scotty-the-fifth.chris.vpn.heusel.eu: 100.64.0.6 -- link: tailscale0

# beta1
$ resolvectl status tailscale0
Link 8 (tailscale0)
    Current Scopes: DNS
         Protocols: +DefaultRoute -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 100.100.100.100
       DNS Servers: 100.100.100.100
        DNS Domain: vpn.heusel.eu ~.

$ resolvectl query --cache=NO vpn.heusel.eu
vpn.heusel.eu: Name 'vpn.heusel.eu' not found

# extra records
$ resolvectl query --cache=NO grafana.vpn.heusel.eu               
grafana.vpn.heusel.eu: 100.64.0.6              -- link: tailscale0

# node 
$ resolvectl query --cache=NO scotty-the-fifth.chris.vpn.heusel.eu
scotty-the-fifth.chris.vpn.heusel.eu: Name 'scotty-the-fifth.chris.vpn.heusel.eu' not found

So this means apparently it now sets the "DNS Domain" to a different value, but I'm not sure whether that causes the issue 🤔

Since it might be of interest, here is the output of my DNS config:

dns_config:
  override_local_dns: true

  nameservers:
    - 8.8.8.8

  restricted_nameservers:
    fritz.box:
      - 192.168.71.5

  domains: []

  extra_records:
    - name: "grafana.vpn.heusel.eu"
      type: "A"
      value: "100.64.0.6"

    - name: "prometheus.vpn.heusel.eu"
      type: "A"
      value: "100.64.0.6"

    - name: "alertmanager.vpn.heusel.eu"
      type: "A"
      value: "100.64.0.6"

    - name: "repo.vpn.heusel.eu"
      type: "A"
      value: "100.64.0.6"

  magic_dns: true

  base_domain: vpn.heusel.eu

Also @kradalby thanks for looking into this, this is very much appreciated! ❤️

christian-heusel commented 1 month ago

Possible duplicates/related issues given my latest findings: #2029 #2026

kradalby commented 1 month ago

ah yes, a DNS issue might be the potential culprit, while waiting for a reply I started to write up some clearly missing DNS tests, so will continue with that then. I'll post when I have an update, maybe on either of those two issues.

kradalby commented 1 month ago

I think #2034 addresses this, would it be possible for you to help me test it? would be great to avoid another bad release like beta1.

Binary is available here: https://github.com/juanfont/headscale/actions/runs/10195837541?pr=2034

christian-heusel commented 1 month ago

@kradalby thanks for working on a fix! 🤗

Except for the fact that I had to rename from dns_config to dns the mentioned PR did not fix the issues 😅 Also there was no error about the rename from restricted_nameservers to split, but setting it also did not help, same for the addition of global in the nameservers directive 🤔

kradalby commented 1 month ago

Except for the fact that I had to rename from dns_config to dns the mentioned PR did not fix the issues 😅

Yes, sorry, thats part of the PR, I have one theory looking at your config, can you try setting a dns.base_name different from the DNS name you use for headscale? so magicdns.vpn.heusel.eu as base_name and keep vpn.heusel.eu for the headscale?

Also there was no error about the rename from restricted_nameservers to split, but setting it also did not help, same for the addition of global in the nameservers directive 🤔

Did you not get any warnings at the beginning of your logs? I've made it so if not replaced it should fatal now.

kradalby commented 1 month ago

To test, you can set the dns.use_username_in_magic_dns to true, which will be removed, but it will temp give you back the username in the dns, which should have the same effect.

This might be a good thing that we discovered, that having the same base_name and headscale dns name will no longer be possible due to how Tailscale takes over the DNS.

For the record, in Tailscale upstream, this is the same behaviour:

derps have their own DNS
controlserver has its own dns (login/control.tailscale.com IIRC)
"basename" is deparate (e.g. bee-velociraptor.ts.net)

so by headscale injecting username stuff, it did not break before, but that prevents us from achieving some other things, so it sadly has to go.

kradalby commented 3 weeks ago

@christian-heusel did you have an opportunity to test this?

christian-heusel commented 3 weeks ago

Sorry I forgot about this, will test and report soon!

christian-heusel commented 3 weeks ago

To test, you can set the dns.use_username_in_magic_dns to true, which will be removed, but it will temp give you back the username in the dns, which should have the same effect.

This makes the three types of queries from above work again 😊 👍🏻

Regarding https://github.com/juanfont/headscale/issues/2025#issuecomment-2264760872:

When unsetting the previously set dns.use_username_in_magic_dns and setting the base_name as requested it also works as expected 👍🏻

Did you not get any warnings at the beginning of your logs? I've made it so if not replaced it should fatal now.

Maybe I'm testing this wrong, but I dont get any warnings/fatal versions with the latest version of your branch and the following DNS config snippet (which I have verified to be the active one inside of the confainer by running docker compose exec headscale cat /etc/headscale/config.yaml):

dns:
  override_local_dns: true
  nameservers:
    # global:
      - 8.8.8.8
  restricted_nameservers:
  # split:
    fritz.box:
      - 192.168.71.5
  domains: []
  magic_dns: true
  base_domain: magicdns.vpn.heusel.eu

Instead I'm being warned about a key I don't even have set:

WARN: The "dns.use_username_in_magic_dns" configuration key is deprecated and has been removed. Please see the changelog for more details.

christian-heusel commented 3 weeks ago

Edit: reverted bogus comment here, I tried to connect against a node of mine that went offline for unbeknownst reaons. 😆

kradalby commented 2 weeks ago

Maybe I'm testing this wrong, but I dont get any warnings/fatal versions with the latest version of your branch and the following DNS config snippet (which I have verified to be the active one inside of the confainer by running docker compose exec headscale cat /etc/headscale/config.yaml):

hmm, I you wont really get any errors/warnings for setting the wrong keys, for example dns.nameservers isnt checked, while dns_config.nameservers is checked. I suppose we could do it, but there is no good way in cobra to cover all cases, only the ones we can think about.

At the moment it will only warn if you have the old set, and not the new. if you mix, it wont detect it.

juanfont / headscale