siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.83k stars 546 forks source link

Nodes show up several times in talosctl and etcd, not in `kubectl get nodes` #9145

Closed croneter closed 2 months ago

croneter commented 3 months ago

Bug Report

I somehow butched something with Talos, don't ask me how :-(. Nodes now show up 3x in both talosctl get members and talosctl etcd members list. The Kubernetes cluster seems to work fine in principle. BUT I e.g. just upgraded Talos from 1.7.5 to 1.7.6, that was a pain as the update failed repeatetly, probably because the wrong node entry was selected among the 3 entries at random.

Description

Let me know what else you might need.

Node appearing three times in Talos and etcd:

$ talosctl get members
NODE            NAMESPACE   TYPE     ID                      VERSION   HOSTNAME                MACHINE TYPE   OS               ADDRESSES
192.168.50.50   cluster     Member   talos-control-plane-1   1         talos-control-plane-1   controlplane   Talos (v1.7.6)   ["192.168.50.50"]
192.168.50.50   cluster     Member   talos-control-plane-2   2         talos-control-plane-2   controlplane   Talos (v1.7.6)   ["192.168.50.51"]
192.168.50.50   cluster     Member   talos-control-plane-3   2         talos-control-plane-3   controlplane   Talos (v1.7.6)   ["192.168.50.52"]
192.168.50.50   cluster     Member   talos-worker-node-1     2         talos-worker-node-1     worker         Talos (v1.7.6)   ["192.168.50.60"]
192.168.50.51   cluster     Member   talos-control-plane-1   1         talos-control-plane-1   controlplane   Talos (v1.7.6)   ["192.168.50.50"]
192.168.50.51   cluster     Member   talos-control-plane-2   1         talos-control-plane-2   controlplane   Talos (v1.7.6)   ["192.168.50.51"]
192.168.50.51   cluster     Member   talos-control-plane-3   2         talos-control-plane-3   controlplane   Talos (v1.7.6)   ["192.168.50.52"]
192.168.50.51   cluster     Member   talos-worker-node-1     1         talos-worker-node-1     worker         Talos (v1.7.6)   ["192.168.50.60"]
192.168.50.52   cluster     Member   talos-control-plane-1   1         talos-control-plane-1   controlplane   Talos (v1.7.6)   ["192.168.50.50"]
192.168.50.52   cluster     Member   talos-control-plane-2   1         talos-control-plane-2   controlplane   Talos (v1.7.6)   ["192.168.50.51"]
192.168.50.52   cluster     Member   talos-control-plane-3   1         talos-control-plane-3   controlplane   Talos (v1.7.6)   ["192.168.50.52"]
192.168.50.52   cluster     Member   talos-worker-node-1     1         talos-worker-node-1     worker         Talos (v1.7.6)   ["192.168.50.60"]
192.168.50.60   cluster     Member   talos-control-plane-1   1         talos-control-plane-1   controlplane   Talos (v1.7.6)   ["192.168.50.50"]
192.168.50.60   cluster     Member   talos-control-plane-2   2         talos-control-plane-2   controlplane   Talos (v1.7.6)   ["192.168.50.51"]
192.168.50.60   cluster     Member   talos-control-plane-3   2         talos-control-plane-3   controlplane   Talos (v1.7.6)   ["192.168.50.52"]
192.168.50.60   cluster     Member   talos-worker-node-1     1         talos-worker-node-1     worker         Talos (v1.7.6)   ["192.168.50.60"]

$ talosctl etcd members list
WARNING: 1 error occurred:
    * 192.168.50.60: rpc error: code = Unimplemented desc = member list is only available on control plane nodes

NODE            ID                 HOSTNAME                PEER URLS                    CLIENT URLS                  LEARNER
192.168.50.51   1ec42e0bea7e4436   talos-control-plane-2   https://192.168.50.51:2380   https://192.168.50.51:2379   false
192.168.50.51   2fda7882f7499c96   talos-control-plane-1   https://192.168.50.50:2380   https://192.168.50.50:2379   false
192.168.50.51   d5f203964465a0fa   talos-control-plane-3   https://192.168.50.52:2380   https://192.168.50.52:2379   false
192.168.50.52   1ec42e0bea7e4436   talos-control-plane-2   https://192.168.50.51:2380   https://192.168.50.51:2379   false
192.168.50.52   2fda7882f7499c96   talos-control-plane-1   https://192.168.50.50:2380   https://192.168.50.50:2379   false
192.168.50.52   d5f203964465a0fa   talos-control-plane-3   https://192.168.50.52:2380   https://192.168.50.52:2379   false
192.168.50.50   1ec42e0bea7e4436   talos-control-plane-2   https://192.168.50.51:2380   https://192.168.50.51:2379   false
192.168.50.50   2fda7882f7499c96   talos-control-plane-1   https://192.168.50.50:2380   https://192.168.50.50:2379   false
192.168.50.50   d5f203964465a0fa   talos-control-plane-3   https://192.168.50.52:2380   https://192.168.50.52:2379   false

$ talosctl etcd status
WARNING: 1 error occurred:
    * 192.168.50.60: rpc error: code = Unimplemented desc = etcd status is only available on control plane nodes

NODE            MEMBER             DB SIZE   IN USE           LEADER             RAFT INDEX   RAFT TERM   RAFT APPLIED INDEX   LEARNER   ERRORS
192.168.50.52   d5f203964465a0fa   80 MB     27 MB (33.84%)   1ec42e0bea7e4436   2368629      10          2368629              false     
192.168.50.50   2fda7882f7499c96   80 MB     27 MB (33.88%)   1ec42e0bea7e4436   2368629      10          2368629              false     
192.168.50.51   1ec42e0bea7e4436   80 MB     27 MB (33.88%)   1ec42e0bea7e4436   2368629      10          2368629              false     

$ kubectl get nodes -o wide
NAME                    STATUS   ROLES           AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
talos-control-plane-1   Ready    control-plane   23d     v1.30.3   192.168.50.50   <none>        Talos (v1.7.6)   6.6.43-talos     containerd://1.7.18
talos-control-plane-2   Ready    control-plane   2d23h   v1.30.3   192.168.50.51   <none>        Talos (v1.7.6)   6.6.43-talos     containerd://1.7.18
talos-control-plane-3   Ready    control-plane   2d23h   v1.30.3   192.168.50.52   <none>        Talos (v1.7.6)   6.6.43-talos     containerd://1.7.18
talos-worker-node-1     Ready    <none>          23d     v1.30.3   192.168.50.60   <none>        Talos (v1.7.6)   6.6.43-talos     containerd://1.7.18

Removing a node with talosctl -n 192.168.50.52 reset results in errors that 192.168.50.52 not being found - without the ability to e.g. delete it in etcd.

$ talosctl get members
NODE            NAMESPACE   TYPE     ID                      VERSION   HOSTNAME                MACHINE TYPE   OS               ADDRESSES
192.168.50.50   cluster     Member   talos-control-plane-1   1         talos-control-plane-1   controlplane   Talos (v1.7.6)   ["192.168.50.50"]
192.168.50.50   cluster     Member   talos-control-plane-2   1         talos-control-plane-2   controlplane   Talos (v1.7.6)   ["192.168.50.51"]
192.168.50.50   cluster     Member   talos-worker-node-1     1         talos-worker-node-1     worker         Talos (v1.7.6)   ["192.168.50.60"]
192.168.50.51   cluster     Member   talos-control-plane-1   1         talos-control-plane-1   controlplane   Talos (v1.7.6)   ["192.168.50.50"]
192.168.50.51   cluster     Member   talos-control-plane-2   1         talos-control-plane-2   controlplane   Talos (v1.7.6)   ["192.168.50.51"]
192.168.50.51   cluster     Member   talos-worker-node-1     1         talos-worker-node-1     worker         Talos (v1.7.6)   ["192.168.50.60"]
192.168.50.60   cluster     Member   talos-control-plane-1   1         talos-control-plane-1   controlplane   Talos (v1.7.6)   ["192.168.50.50"]
192.168.50.60   cluster     Member   talos-control-plane-2   2         talos-control-plane-2   controlplane   Talos (v1.7.6)   ["192.168.50.51"]
192.168.50.60   cluster     Member   talos-worker-node-1     1         talos-worker-node-1     worker         Talos (v1.7.6)   ["192.168.50.60"]
1 error occurred:
    * rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 192.168.50.52:50000: connect: no route to host"

$ talosctl etcd members list
WARNING: 2 errors occurred:
    * 192.168.50.52: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 192.168.50.52:50000: connect: no route to host"
    * 192.168.50.60: rpc error: code = Unimplemented desc = member list is only available on control plane nodes

NODE            ID                 HOSTNAME                PEER URLS                    CLIENT URLS                  LEARNER
192.168.50.50   1ec42e0bea7e4436   talos-control-plane-2   https://192.168.50.51:2380   https://192.168.50.51:2379   false
192.168.50.50   2dfcd3cad6452bae   talos-control-plane-1   https://192.168.50.50:2380   https://192.168.50.50:2379   false
192.168.50.51   1ec42e0bea7e4436   talos-control-plane-2   https://192.168.50.51:2380   https://192.168.50.51:2379   false
192.168.50.51   2dfcd3cad6452bae   talos-control-plane-1   https://192.168.50.50:2380   https://192.168.50.50:2379   false

Logs

etcd log attached: etcd.log.txt

Environment

smira commented 3 months ago

You specified all nodes in talosctl config nodes. It is not recommended. So you get a mix of results from multiple nodes at once

croneter commented 3 months ago

Ok thanks @smira. How do I fix that?

smira commented 3 months ago

https://www.talos.dev/v1.7/introduction/getting-started/#understand-talosctl-endpoints-and-nodes

The easiest you can do talosctl config node A.B.C.D

croneter commented 3 months ago

Sorry @smira, this is not really helping yet. I'm trying to re-join 192.168.50.52 (see above, removed the node from the cluster). But talosctl apply-config -n 192.168.50.52 --file controlplane.yaml --insecure is not working: after a reboot, the node returns to a Stage: Maintenance

Could you be so kind and elaborate a bit more?

smira commented 2 months ago

Let's not mix several things as one issue.

Please open a discussion with a question, but if apply-config doesn't work and the machine returns to maintenance, there should be something in the logs which explains why, but I can't guess.

I'm going to close this issue, please open a new one if the problem is still there.