siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.78k stars 542 forks source link

Etcd bootstrap not working #8955

Closed thoro closed 4 months ago

thoro commented 4 months ago

Bug Report

Description

I'm unable to bootstrap my etcd cluster.

talosctl --talosconfig=./talosconfig -e at-cl02-h03 -n at-cl02-h03 bootstrap

Exits successfully, but on the corresponding node the only output is the following:

user: warning: [2024-07-02T08:31:08.056453978Z]: [talos] bootstrap request received
 user: warning: [2024-07-02T08:31:08.109928978Z]: [talos] service[etcd](Failed): Condition failed: context canceled
 user: warning: [2024-07-02T08:31:08.195521978Z]: [talos] service[etcd](Finished): Bootstrap requested
 user: warning: [2024-07-02T08:31:08.267696978Z]: [talos] service[etcd](Starting): Starting service
 user: warning: [2024-07-02T08:31:08.267740978Z]: [talos] service[etcd](Waiting): Waiting for service "cri" to be "up", time sync, network, etcd spec
 user: warning: [2024-07-02T08:31:09.268180978Z]: [talos] service[etcd](Waiting): Waiting for etcd spec
 user: warning: [2024-07-02T08:31:13.120902978Z]: [talos] task startAllServices (1/1): service "kubelet" to be "up"
 user: warning: [2024-07-02T08:31:13.206462978Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occu

Service itself also is "Waiting for spec"

talosctl get etcdconfigs

NODE          NAMESPACE   TYPE         ID     VERSION   IMAGE
at-cl02-h03   etcd        EtcdConfig   etcd   3         gcr.io/etcd-development/etcd:v3.5.13

talosctl get etcdspecs

NODE   NAMESPACE   TYPE   ID   VERSION   NAME   ADVERTISEDADDRESSES   LISTENPEERADDRESSES   LISTENCLIENTADDRESSES

It seems that the controller never created the etcdspecs, possibly because of the IPs?

Based on code, and no errors, I assume it runs into this continue here: https://github.com/siderolabs/talos/blob/cc345c8c9413692148360684390c910de9e94748/internal/app/machined/pkg/controllers/etcd/spec.go#L203

Relevant parts of the controlplane.yaml:

Tried once like this:

etcd:
        ca:
            crt: ....
            key: ...

And once like this:

    etcd:
        ca:
            crt: ...
            key: ...
        advertisedSubnets:
            - 10.12.2.0/24
        listenSubnets:
            - 10.12.2.0/24

And the node addresses:

talosctl get nodeaddresses

NODE          NAMESPACE   TYPE          ID                      VERSION   ADDRESSES
at-cl02-h03   network     NodeAddress   accumulative            3         ["10.12.1.3/24","10.12.2.3/24"]
at-cl02-h03   network     NodeAddress   accumulative-no-k8s     1         []
at-cl02-h03   network     NodeAddress   accumulative-only-k8s   1         ["10.12.1.3/24","10.12.2.3/24"]
at-cl02-h03   network     NodeAddress   current                 3         ["10.12.1.3/24","10.12.2.3/24"]
at-cl02-h03   network     NodeAddress   current-no-k8s          1         []
at-cl02-h03   network     NodeAddress   current-only-k8s        1         ["10.12.1.3/24","10.12.2.3/24"]
at-cl02-h03   network     NodeAddress   default                 1         ["10.12.2.3/24"]
at-cl02-h03   network     NodeAddress   routed                  3         ["10.12.1.3/24","10.12.2.3/24"]
at-cl02-h03   network     NodeAddress   routed-no-k8s           1         []
at-cl02-h03   network     NodeAddress   routed-only-k8s         1         ["10.12.1.3/24","10.12.2.3/24"]

Logs

support.zip

Environment

thoro commented 4 months ago

Turns out I set the podSubnets network by mistake to include the node network!

cluster:
  network:
    podSubnets:
    - 10.12.16.0/19

Correct would have been 10.12.32.0/19

I would suggest to add a log line here: https://github.com/siderolabs/talos/blob/cc345c8c9413692148360684390c910de9e94748/internal/app/machined/pkg/controllers/etcd/spec.go#L137

so that it's easier to be found.

smira commented 4 months ago

Talos 1.8 already has diagnostics which help in this particular case. So thanks for reporting, and 1.8 will make issues like that more obious