siderolabs / terraform-provider-talos

Mozilla Public License 2.0
115 stars 15 forks source link

`talos_cluster_health` apparently requires `control_plane_nodes` to be IP addresses #143

Open michaelbeaumont opened 8 months ago

michaelbeaumont commented 8 months ago

I see an error like the following when using talos_cluster_health:

rpc error: code = Unknown desc = ParseAddr("services-cp-0"): unable to parse IP

The error message isn't clear but a little testing shows this is coming from the control_plane_nodes value. This requirement is in contrast to other data sources that take nodes like talos_client_configuration or talos_cluster_kubeconfig.

Additionally, I think examples of exactly how to use talos_cluster_health to do what talos_cluster_kubeconfig.wait did would be helpful. Replacing an argument with a data source deserves explanation.

zargony commented 7 months ago

I ran into the same issue and was wondering what's best practice with endpoint and nodes arguments.

In my talosconfig, I'm using fqdn (at home) or IPs (in cloud) as endpoints and hostnames as nodes. It's easier to distinguish nodes by their name rather than remembering IPs, e.g. with talosctl -n or when reading outputs. So I did the same in terraform rules, but now that this issue came up, I wonder what's best practice for specifying endpoints and nodes.

zargony commented 7 months ago

I just found out that the ip addresses given in control_plane_nodes are not only used for connecting but also to verify whether they're etcd members.

I can imagine that this might be a bit too strict in some cases. E.g. when control plane nodes fail but etcd still has quorum, this health check will fail and potentially block the applying of changes to fix the situation. In my case, the health check took forever and failed because I used the control plane's IPv4 addresses while etcd members use IPv6 addresses. Wouldn't it be more appropriate to relax the check and just check whether etcd has quorum? (This also wouldn't require to know the exact etcd members IPs). Or maybe make it optionally possible to check for quorum only (e.g. by not giving control plane IPs or by setting an option)?

frezbo commented 7 months ago

I just found out that the ip addresses given in control_plane_nodes are not only used for connecting but also to verify whether they're etcd members.

I can imagine that this might be a bit too strict in some cases. E.g. when control plane nodes fail but etcd still has quorum, this health check will fail and potentially block the applying of changes to fix the situation. In my case, the health check took forever and failed because I used the control plane's IPv4 addresses while etcd members use IPv6 addresses. Wouldn't it be more appropriate to relax the check and just check whether etcd has quorum? (This also wouldn't require to know the exact etcd members IPs). Or maybe make it optionally possible to check for quorum only (e.g. by not giving control plane IPs or by setting an option)?

The checks are currently designed for a full cluster wide health (cluster here does not mean kubernetes, but the whole talos cluster).

Etcd advertise subnets can be user defined to specify which addresses to listen on, so it's entirely user customizable, otherwise talos would just try to pick a default

smira commented 7 months ago

@frezbo I think the problem is that the underlying Talos health check is not flexible enough for multi-homed clusters, it assumes a single IP per node. this could be fixed of course.

zargony commented 7 months ago

FYI I was able to work around my issue (control plane using IPv6 addresses not known to Terraform) by using talos_cluster_health with control_plane_nodes = [] which still seems to do some checks (at least it waits for the cluster to become ready when creating the control plane)

(Yes, etcd advertise subnets can be configured. I intentionally set it to 2000::/3 on my cloud servers since IPv4 might not always be available on all of them)