siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.04k stars 492 forks source link

DNS resolvers for IPv6 gets removed during boot [OpenStack] #8690

Closed MindTooth closed 1 week ago

MindTooth commented 2 months ago

Bug Report

Thanks for the help so far. 🙏🏻

Description

In OpenStack I have two subnets, one IPv4 and one for IPv6. On both I have set two DNS servers, in total four. However, the resolvers for IPv6 gets removed. Also, the network setup flips back and forth, so it's difficult for me to understand what is happening.

E.g. I have like five [talos] setting resolvers {"component": "controller-runtime", "controller": "network.ResolverSpecController", "resolvers": messages and four updated dns server nameservers {"component": "dns-resolve-cache", "addrs" messages.

Some messages about the IPv4 address being removed/added twice. So for me it's so strange to see that the interface is reconfigured so many times during a boot. :smile:

Currently on Cilium. I can try Flannel too?


Ed1t: seems that because of this, looking up IPv6 takes time, resulting in some time before it proceeds. Does it use some fallback when IPv6 resolvers are not added? talosctl logs dns-resolve-cache does not show lookups.

Logs

dns_issue.tgz - Must be de-encrypted.


Ed1t: ran with metadata service as the initial cluster was with cloud-drive. This gave a different result:

debug_cp2_metadata.tgz - decrypt

Now it can't find the resolvers for IPv6 at all. 🤔

network_data.json - decrypt

Environment

Client:
    Tag:         v1.7.1
    SHA:         e9cb904e
    Built:       
    Go version:  go1.22.2
    OS/Arch:     darwin/arm64
Server:
    NODE:        10.10.10.51
    Tag:         v1.7.1
    SHA:         e9cb904e
    Built:       
    Go version:  go1.22.2
    OS/Arch:     linux/amd64
    Enabled:     RBAC
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0
smira commented 2 months ago

I'm not sure what the issue is, but it might be easier for you do the initial triage to help me with that, as it's hard to dig into configuration with zero knowledge about the context.

Following the guide you can yourself see what are the all configuration sources and how it got merged.

E.g. for the resolvers:

  1. Get all configured resolvers (from all sources):
    talosctl get resolverspecs --namespace=network-config -o yaml
  2. Get final merged configuration:
    talosctl get resolverspecs -o yaml

If you see an issue at this point (something got merged wrong, wrong priority), you can either fix it in on your side, or report an issue.

The remaining piece is translation of OpenStack metadata to Talos network configuration, an easy way here is to compare the OpenStack metadata document (which you already have) and what Talos translated it to, which you can read with talosctl read /system/state/platform-network.yaml (see here). If there's a bug at this point, a specific issue might be helpful, so we can add a buggy path to the unit-tests.

smira commented 2 months ago

I might guess there's a conflict between DNS resolvers coming from DHCP, OpenStack metadata and your machine config, but some data gathered as I outlined above would help a lot.

MindTooth commented 2 months ago

Yes, you are right. Operator before Platform. I've attached the outputs.

resolve_issue.tgz


So, from the data you can see that DHCP4 takes present over platform (IPv6 SLAAC).

169.254.169.254/openstack/latest/network_data.json contains a section for "services": [] and inside you have "type": "dns". I would assume, that this will always contain the cumulative collection of all DNS adresses added to the subnets.

https://github.com/siderolabs/talos/blob/78b48eb3ae78ec9953104247ec73cafa26a61264/internal/app/machined/pkg/runtime/v1alpha1/platform/openstack/testdata/network.json#L143-L153

Would it be natural for OpenStack to include all DNS from DHCP and also append the unique adresses from "services":?

Without explicitly setting the DNS inside machine:, Talos should gather all resolvers by OpenStack. This use case is maybe rare with IPv6 and especially with SLAAC. But, either we need to update docs to force users to set resolvers explicitly or change the logic for the OpenStack integration.

Thoughts?

Thank you for taking the time to reply.

smira commented 2 months ago

This one seems similar to the hostname issue, but not quite the same.

As I get from your dump, OpenStack returns a full list of DNS servers (2 IPv4 + 2 IPv6), but configures the interface to run DHCPv4, which obviously only return 2 IPv4 DNS servers (can't return IPv6). So in this particular case, I wonder if we should take some special rules to merge the lists of resolvers, as clearly we could be smart enough to preserve the IPv6 resolvers. I will think about this case a bit more.