siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.39k stars 514 forks source link

Don't attempt to create sockets for unavailable IP address family #8659

Open sanmai-NL opened 4 months ago

sanmai-NL commented 4 months ago

Feature Request

Description

On a single-stack IPv6 Talos Linux deployment, I noticed that the A DNS Resource Record (IPv4) for discovery.talos.dev is resolved, and an attempt is made to connect to the resulting IPv4-address. However, on this node no network interface has an IPv4 address assigned (with the possible exception of the loopback interface). This despite the fact that discovery.talos.dev also has an AAAA Resource Record (IPv6).

[talos] 2024/04/25 20:21:00 hello failed {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 172.174.35.21:443: connect: network is unreachable\"", "endpoint": "discovery.talos.dev:443"}

This issue is an epic to track all instances where Talos Linux components create sockets for unavailable IP address families.

Value

The value of this feature is reduction of logging noise, better performance efficiency and better energy efficiency.

smira commented 4 months ago

The standard behavior of all dialers is to try both IPv4 and IPv6 concurrently, with some delay. It will use whatever responds first. If you see an error, both failed, but only one error is reported. This is the way standard library works, and has nothing to do with Talos specifically, i.e. it will be same in any other Go-based component like containerd or Kubernetes.

Userspace programs can't guess which network is available, and having or not having addresses assigned might be not enough to do a valid guess.

Talos Linux retries on all failures, so there's no functional issue here, nor there is no real delay to get to the working state. If IPv6 worked at that moment, there would be no error at all.

sanmai-NL commented 3 months ago

@smira Can you please reopen this?

See https://github.com/golang/go/issues/25321#issuecomment-948787769. It's been reported as an issue by users multiple other Go-based products and an easy fix is possible.

smira commented 3 months ago

I can re-open this issue, but I don't quite see what can be done unless there's a specific bug here (which we'd be happy to look into).

Any static pre-check on availability of something doesn't make sense in Talos, as it reconfigures networking on the fly, and whatever seemed to be IPv4 environment might become IPv6 and vice versa. Doing checks on every dial operation is more expensive than just trying to dial.

Talos does small amount of network operations in general (compared to other components running on the machine).

sanmai-NL commented 3 months ago

@smira the kernel parameters aren't reconfigured on-the-fly, are they? These can be set to enforce single-stack IPv4, for example. I don't expect full support of all dynamic conditions, nor do I restrict the design to a single check per lifecycle (init stage). Another improvement towards this would be to handle exceptions from dialers so that only true faults (errors) are logged as such.

smira commented 3 months ago

@smira the kernel parameters aren't reconfigured on-the-fly, are they? These can be set to enforce single-stack IPv4, for example. I don't expect full support of all dynamic conditions, nor do I restrict the design to a single check per lifecycle (init stage). Another improvement towards this would be to handle exceptions from dialers so that only true faults (errors) are logged as such.

One can disable IPv6, but this is too much of an outlier these days. Both SideroLink and KubeSpan rely on IPv6 addressing (not connetivity), so I don't expect many people disabling IPv6.

I would rather prefer not to introduce hacks in the OS unless there's a major issue.