Open walnuss0815 opened 1 month ago
I had a similar problem and solved it this way:
I do not use cluster_health
. I just have a terraform time_sleep
of 90s and then try to apply cilium and argocd. I do not wait till the cluster is ready, if its not after that time something has gone wrong anyway and it doen't matter if it fails at the cni/argo apply or at the health check.
The CSR not being approved is not a talos problem, kubernetes decided to ditch the automatic serving cert approver. I also used alex1989hu approver but then discovered that talos has a coloud-controller-manager that also ships a csr approver module. Since I also like to have all/most of my apps managed by argo I decided to alter alex1989hu approver to be a batch/job
that runs once for the first 5 min. This is applied via cluster.extraManifests
. After that I just use the argo managed talos ccm but you coluld also use alex1989hu approver as a argo managed deployment after the initial job finished.
Hope this helps
We are using the provider to deploy a two node k8s bare metal cluster.
We need the certificate rotation enabled for the metrics server. The Kubelet Serving Certificate Approver is being deployed using Argo CD and Argo CD is being deployed using Terraform right after the Talos cluster has been bootstrapped. Deploying the Kubelet Serving Certificate Approver via
.cluster.extraManifests
is not an option for us.Without
talos_cluster_health
the deployment of Argo CD fails, because the k8s api is not ready. So in our case the health check is only required to ensure that the k8s api is ready for requests.With
talos_cluster_health
the health check fails. On the first run it fails withmissing static pods on node
. On the second run it fails withkubelet server certificate rotation is enabled, but CSR is not approved
.First run
``` │ Warning: failed checks │ │ with module.talos.data.talos_cluster_health.this, │ on .terraform/modules/talos/main.tf line 118, in data "talos_cluster_health" "this": │ 118: data "talos_cluster_health" "this" { │ │ waiting for etcd to be healthy: ... │ waiting for etcd to be healthy: 1 error occurred: │ * 192.168.x.y: service is not healthy: etcd │ │ │ waiting for etcd to be healthy: OK │ waiting for etcd members to be consistent across nodes: ... │ waiting for etcd members to be consistent across nodes: OK │ waiting for etcd members to be control plane nodes: ... │ waiting for etcd members to be control plane nodes: OK │ waiting for apid to be ready: ... │ waiting for apid to be ready: OK │ waiting for all nodes memory sizes: ... │ waiting for all nodes memory sizes: OK │ waiting for all nodes disk sizes: ... │ waiting for all nodes disk sizes: OK │ waiting for no diagnostics: ... │ waiting for no diagnostics: OK │ waiting for kubelet to be healthy: ... │ waiting for kubelet to be healthy: 1 error occurred: │ * 192.168.x.y service "kubelet" not in expected state "Running": current state [Preparing] Running pre state │ │ │ waiting for kubelet to be healthy: 1 error occurred: │ * 192.168.x.y: service is not healthy: kubelet │ │ │ waiting for kubelet to be healthy: OK │ waiting for all nodes to finish boot sequence: ... │ waiting for all nodes to finish boot sequence: OK │ waiting for all k8s nodes to report: ... │ waiting for all k8s nodes to report: Get "https://192.168.x.y:6443/api/v1/nodes": dial tcp 192.168.x.y:6443: connect: connection refused │ waiting for all k8s nodes to report: can't find expected node with IPs ["192.168.x.y"] │ waiting for all k8s nodes to report: OK │ waiting for all control plane static pods to be running: ... │ waiting for all control plane static pods to be running: missing static pods on node 192.168.x.y: [kube-system/kube-apiserver kube-system/kube-controller-manager kube-system/kube-scheduler] ```Second run
``` │ Warning: failed checks │ │ with module.talos.data.talos_cluster_health.this, │ on .terraform/modules/talos/main.tf line 118, in data "talos_cluster_health" "this": │ 118: data "talos_cluster_health" "this" { │ │ waiting for etcd to be healthy: ... │ waiting for etcd to be healthy: OK │ waiting for etcd members to be consistent across nodes: ... │ waiting for etcd members to be consistent across nodes: OK │ waiting for etcd members to be control plane nodes: ... │ waiting for etcd members to be control plane nodes: OK │ waiting for apid to be ready: ... │ waiting for apid to be ready: OK │ waiting for all nodes memory sizes: ... │ waiting for all nodes memory sizes: OK │ waiting for all nodes disk sizes: ... │ waiting for all nodes disk sizes: OK │ waiting for no diagnostics: ... │ waiting for no diagnostics: active diagnostics: 192.168.x.y: kubelet server certificate rotation is enabled, but CSR is not approved ```With the Kubelet Serving Certificate Approver deployed manually after the k8s api is ready, the health check succeeds and Terraform starts deploying Argo CD.
main.tf
```terraform . . . resource "talos_machine_bootstrap" "this" { depends_on = [talos_machine_configuration_apply.controlplane] client_configuration = talos_machine_secrets.this.client_configuration node = [for k, v in var.node_data.controlplanes : v.ip_address][0] } data "talos_cluster_health" "this" { depends_on = [talos_machine_bootstrap.this] client_configuration = talos_machine_secrets.this.client_configuration control_plane_nodes = [for k, v in var.node_data.controlplanes : v.ip_address] endpoints = [for k, v in var.node_data.controlplanes : v.ip_address] skip_kubernetes_checks = true } resource "talos_cluster_kubeconfig" "this" { depends_on = [data.talos_cluster_health.this] client_configuration = talos_machine_secrets.this.client_configuration node = [for k, v in var.node_data.controlplanes : v.ip_address][0] } . . . ```Our expectation is that the health check succeeds with kubelet server certificate rotation enabled and the Kubelet Serving Certificate Approver not deployed. Something like a minimal k8s readiness check would also be sufficient in our case.