siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.47k stars 517 forks source link

Improve `talosctl get members` and `talosctl health` #7759

Closed steverfrancis closed 3 months ago

steverfrancis commented 1 year ago

Feature Request

Make talosctl get members show if KubeSpan is enabled, and if so, show basic status. Include this in talosctl health

Description

Kubespan is fabulous, but it makes troubleshooting a bit harder, as it is mostly transparent. If you forget to allow UDP traffic, for example, this is not evident from the usual talosctl commands.

So in the case of talosctl get members, check if KubeSpan is enabled on the --node. If so, add a column to show the output of the State column from talosctl get kubespanpeerstatus so the output would be something like:

talosctl --talosconfig talosconfig get members                           
NODE             NAMESPACE   TYPE     ID                 VERSION   HOSTNAME                                      MACHINE TYPE   OS               KUBESPAN   ADDRESSES
54.189.199.144   cluster     Member   ip-172-31-0-51     1         ip-172-31-0-51.us-west-2.compute.internal     controlplane   Talos (v1.5.0)   up      ["172.31.0.51","fd52:546c:f41c:3302:888:b7ff:fe2f:db37"]
54.189.199.144   cluster     Member   ip-172-31-12-36    2         ip-172-31-12-36.us-west-2.compute.internal    controlplane   Talos (v1.5.0)   up      ["172.31.12.36","fd52:546c:f41c:3302:8cf:6ff:fea0:1f0b"]
54.189.199.144   cluster     Member   ip-172-31-13-147   1         ip-172-31-13-147.us-west-2.compute.internal   controlplane   Talos (v1.5.0)   unknown ["172.31.13.147","fd52:546c:f41c:3302:848:9fff:fe33:fae9"]
54.189.199.144   cluster     Member   talos-iio-wgg      2         talos-iio-wgg                                 worker         Talos (v1.5.2)   up      ["192.168.64.62","fd2a:59c8:2c5f:e2bc:f0d1:fbff:fe3f:7e99","fd52:546c:f41c:3302:f0d1:fbff:fe3f:7e99"]

If architecturally its too cumbersome to display data from different sources like that, then it would be fine (but less nice) to simply also display the output of the command talosctl get kubespanpeerstatus after the current output of get members, if KubeSpan is enabled.

Then, include the new output of get members at the start of talosctl health So that it would be clear why a node may not be responding. Currently, if a node is not KubeSpan connected, the Health output is just * 192.168.64.62: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 192.168.64.62:50000: i/o timeout"

frezbo commented 1 year ago

i don't think resoucre outputs should change, probably this info can go in dashboard, instead of get, maybe a new pane indeed

andrewrynhard commented 1 year ago

It's starting to feel to me like we could benefit from completely internal resources and then have public ones that can be freely updated to present things in a more user friendly way without having to change internal logic.

steverfrancis commented 1 year ago

That makes sense - we want resources to be limited and specific. But we also want user friendly outputs, which will often involve multiple resources.

andrewrynhard commented 1 year ago

That makes sense - we want resources to be limited and specific.

But we also want user friendly outputs, which will often involve multiple resources.

Exactly. Allows engineers to optimize for their needs and product to optimize for user needs.

smira commented 1 year ago

I think members has a different meaning - it's something which should always show same information across all nodes in the cluster (and it's a problem if it doesn't).

KubeSpan status affects a pair of nodes, so in non-healthy KubeSpan setup, this output will always be different across nodes.

Adding a check to talosctl health makes more sense to me, but even that is not trivial, as if the KubeSpan is down, talosctl health might not be able to reach out to all the nodes.

I almost think we should have a way in health to diagnose and offer suggestions on most of the problems. Instead of reporting something as 'not okay', we should be more specific about the error and suggest a solution or point to the exact problem.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 3 months ago

This issue was closed because it has been stalled for 7 days with no activity.