siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.87k stars 552 forks source link

Split dns on talos machine config #7287

Open btrepp opened 1 year ago

btrepp commented 1 year ago

Feature Request

Allow configuring certain domains to be forwarded to other DNS resolvers.

Description

I've been developing a Tailscale extension to allow talos nodes to have Tailscale IPs (and the long term goal is to talk to backend services such as storage, over a Tailscale network).

https://github.com/siderolabs/extensions/pull/154

One of the issues is that it would be great to uses tail scales magic dns, so you can do things like 'nas' in your config files and dns will point you to the correct Tailscale machine.

Tailscale includes this, however it tries to write over /etc/resolv.conf. This works great if I bind mount it, but when things go wrong, they go really wrong.

Current workaround

At the moment you can run a DNS server externally and configure how you wish, but it does become more external infrastructure you need to maintain. Alternatively you can use your Tailscale IPs directly, but then you do have to make sure the IPs are aligned (and if talos wipes a disk, you are getting a new IP from Tailscale).

smira commented 1 year ago

Long-term I feel we should have system extensions which are critical and run always, and probably have a way to override/inject values into resolv.conf, but many pieces are missing at the moment.

For the registry endpoint, you can use registry mirror config to resolve it to a Tailscale IP, as these are assigned in a static way.

michaelbeaumont commented 1 year ago

@btrepp Maybe you can clear up my confusion.. I appear to be able to use Split DNS with the extension. However, I'm running Talos in a VM on a host machine that is itself part of the tailnet. Could this be the reason Split DNS works, because DNS queries are forwarded outside of the VM to the host's DNS, which is configured with Split DNS?

Search Domains is the feature that fails, presumably because it requires edits to /etc/resolv.conf, even if it's running in said VM.

I create CP nodes named cp-0 with the tailscale extension and set the Kubernetes endpoint to be cp.ts. I've got CoreDNS running outside of Talos configured to answer with a CNAME pointing to cp-0.my-tailnet.ts.net when queried for cp.ts. This CoreDNS is configured for .ts using Split DNS. Everything seems to work... Is it going to go horribly wrong at some point, assuming I keep the VM on a host in the tailnet?

It's when I configure Search Domains for ts and use cp as the Kubernetes endpoint that something seems wrong, namely that although everything seems Healthy and the node is Ready, the node can't reach the API server at cp. Perhaps I could even configure libvirt's dnsmasq to include the search domain...

btrepp commented 1 year ago

Yep. I think basically dns will go up the stack.

For me. It's metal Talos -> router For you it would be Talos -> vm host.

As the extension runs in a container. It doesn't change the Talos Configs. I did experiment with modifying resolve.conf but ended up having a bad time with it.

On Mon, 21 Aug 2023, 08:37 Mike Beaumont, @.***> wrote:

@btrepp https://github.com/btrepp I appear to be able to use Split DNS with the extension. However, I'm running Talos in a VM on a host machine that is itself part of the tailnet. Could this be the reason Split DNS works, because DNS queries are forwarded outside of the VM to the host's DNS, which is configured with Split DNS?

Search Domains is the feature that fails, presumably because it requires edits to /etc/resolv.conf, even if it's running in said VM.

I create CP nodes named cp-0 with the tailscale extension and set the Kubernetes endpoint to be cp.ts. I've got CoreDNS running outside of Talos configured to answer with a CNAME pointing to cp-0.my-tailnet.ts.net when queried for cp.ts. This CoreDNS is configured for .ts using Split DNS. Everything seems to work... Is it going to go horribly wrong at some point, assuming I keep the VM on a host in the tailnet?

It's when I configure Search Domains for ts and use cp as the Kubernetes endpoint that something seems wrong, namely that although everything seems Healthy and the node is Ready, the node can't reach the API server at cp. Perhaps I could even configure libvirt's dnsmasq to include the search domain...

— Reply to this email directly, view it on GitHub https://github.com/siderolabs/talos/issues/7287#issuecomment-1685452175, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAGFIOIBIUELN6HHHR6HDXWKUTPANCNFSM6AAAAAAYRQAZJA . You are receiving this because you were mentioned.Message ID: @.***>

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

michaelbeaumont commented 4 months ago

This would definitely still be a great feature!

rgl commented 3 months ago

now that host-dns exists, maybe this is now possible to implement?

smira commented 3 months ago

It should work in main now with the Tailscale DNS endpoint being the first entry in nameservers and your recursive DNS resolver being the second.

rgl commented 3 months ago

does that mean that Allow configuring certain domains to be forwarded to other DNS resolvers. is in main already (and not tied to tailscale)?

smira commented 3 months ago

I don't know what you're talking about, sorry. I have no idea about Tailscale, all I said is that split DNS should work in main now.

rgl commented 3 months ago

I do not known about tailscale either, since you were the one mentioning it, I wanted to clarify whether this feature was tied to tailscale. By your answer, I will assume, it's not tied to tailscale. :-)

How do I configure this? The 1.8 docs at https://www.talos.dev/v1.8/talos-guides/network/host-dns/ do not seem to mention how to configure this feature.

smira commented 3 months ago

There is no feature at all, it will just correctly iterate over nameservers configured in case if one returns NXDOMAIN/SERVFAIL.

michaelbeaumont commented 2 months ago

@smira AFAICT this doesn't happen with NXDOMAIN https://github.com/siderolabs/talos/blob/7edcbbb833fc56b054ce9ecebc3416f676a51851/internal/pkg/dns/dns.go#L147 assuming we're talking about https://github.com/siderolabs/talos/pull/9179

Is there anything standing in the way of just switching to coredns for node DNS as a separate service?

It's not possible to workaround this either because the order of resolvers doesn't appear to be totally under the users control:

https://github.com/siderolabs/talos/blob/7edcbbb833fc56b054ce9ecebc3416f676a51851/internal/app/machined/pkg/controllers/network/dns_resolve_cache.go#L158-L172

My router DNS seems to always show up first in the list, probably because it comes from DHCP before the machine config is applied.

smira commented 2 months ago

I believe DNS server shouldn't return NXDOMAIN if it doesn't know about the domain, so the DNS server is wrong (if I'm wrong, easy to fix).

The DNS servers on initial boot before machine config is applied can be controlled via kernel cmdline, but the machine config overwrites any DNS servers configured by other means.

michaelbeaumont commented 2 months ago

I believe DNS server shouldn't return NXDOMAIN if it doesn't know about the domain, so the DNS server is wrong (if I'm wrong, easy to fix).

I do agree, just wanted to make it clear it doesn't work with NXDOMAIN, only SERVFAIL.

I think the issue is that Tailscale uses <machine-name>.<network-name>.ts.net as FQDNs but only returns records on its network-internal resolver. Since .ts.net is a real domain, Cloudflare, for example, will return NXDOMAIN. But the network-internal resolver returns the machine IP on the TS overlay network.

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 1
;; AUTHORITY SECTION:
ts.net.         300 IN  SOA ns1.dnsimple.com. admin.dnsimple.com.

;; Query time: 20 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2
;; ANSWER SECTION:
my-machine.my-network.ts.net. 600   IN  A   100.90.80.70

;; Query time: 0 msec
;; SERVER: 100.100.100.100#53(100.100.100.100) (UDP)

The DNS servers on initial boot before machine config is applied can be controlled via kernel cmdline, but the machine config overwrites any DNS servers configured by other means.

It doesn't, from my testing.

EDIT: removed irrelevant code refs

What I see:

❯ talosctl get resolverspec -o yaml
metadata:
    namespace: network
    type: ResolverSpecs.net.talos.dev
    id: resolvers
spec:
    dnsServers:
        - fd7a:115c:a1e0::53
        - 192.168.0.1
    layer: configuration
$ dig @fd7a:115c:a1e0::53 my-machine.my-network.ts.net
my-machine.my-network.ts.net. 600   IN  A   100.90.80.70
$ dig @169.254.116.108 my-machine.my-network.ts.net
ts.net.         10  IN  SOA ns1.dnsimple.com. admin.dnsimple.com.
$ dig @192.168.0.1 my-machine.my-network.ts.net
ts.net.         10  IN  SOA ns1.dnsimple.com. admin.dnsimple.com.
smira commented 2 months ago

Probably it makes sense to create issues with full description for both, as I don't quite understand your case.

Your tailnet resolver should come before CloudFlare one.

DNS servers should be completely changeable with meachine config.

DmitriyMV commented 1 month ago

Just a heads up, since #9310

order of resolvers doesn't appear to be totally under the users control

Is no longer true. So this should be fixed now? By that I mean that with recent PRs second workaround from the original issue should work probably.