syself / cluster-api-provider-hetzner

Cluster API Provider Hetzner :rocket: The best way to manage Kubernetes clusters on Hetzner, fully declarative, Kubernetes-native and with self-healing capabilities
https://caph.syself.com
Apache License 2.0
699 stars 63 forks source link

Talos cluster fails to start #874

Closed tmojzes closed 3 months ago

tmojzes commented 1 year ago

/kind bug

What steps did you take and what happened:

I am trying to set up a k8s cluster on Hetzner but it fails.

Commands used for cluster setup:

Local bootstrap cluster setup:
k3d cluster create --config k3d_config.yaml

kubectl create secret generic hetzner --from-literal=hcloud=$HCLOUD_TOKEN --from-literal=robot-user=$HETZNER_ROBOT_USER --from-literal=robot-password=$HETZNER_ROBOT_PASSWORD

kubectl create secret generic robot-ssh --from-literal=sshkey-name=cluster --from-file=ssh-privatekey=$HETZNER_SSH_PRIV_PATH --from-file=ssh-publickey=$HETZNER_SSH_PUB_PATH

Patch the created secrets so they are automatically moved to the target cluster later.

kubectl patch secret hetzner -p '{"metadata":{"labels":{"clusterctl.cluster.x-k8s.io/move":""}}}' kubectl patch secret robot-ssh -p '{"metadata":{"labels":{"clusterctl.cluster.x-k8s.io/move":""}}}' kubectl apply -f talos-cluster.yaml

**Capi contronller version: 1.5.0**
**Logs:**
```bash
2023/08/25 12:47:59 http: TLS handshake error from 10.42.3.0:48038: EOF
2023/08/25 12:47:59 http: TLS handshake error from 10.42.3.0:48052: EOF
E0825 12:48:06.960220       1 controller.go:324] "Reconciler error" err="failed to create cluster accessor: error creating client for remote cluster \"default/talos-cluster\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: client rate limiter Wait returned an error: context deadline exceeded - error from a previous attempt: EOF" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/talos-cluster-md-0-677674857cxxnncw-hs9lq" namespace="default" name="talos-cluster-md-0-677674857cxxnncw-hs9lq" reconcileID=bfad2f24-36d3-4425-b7ba-cce46b8374c6
I0825 12:48:06.960701       1 machine_controller_phases.go:280] "Infrastructure provider has completed machine infrastructure provisioning and reports status.ready" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/talos-cluster-md-0-677674857cxxnncw-hs9lq" namespace="default" name="talos-cluster-md-0-677674857cxxnncw-hs9lq" reconcileID=023436fa-7eda-44bb-99aa-24fb95dd75c9 MachineSet="default/talos-cluster-md-0-677674857cxxnncw" MachineDeployment="default/talos-cluster-md-0" Cluster="default/talos-cluster" HCloudMachine="default/talos-cluster-md-0-mjn5z"
2023/08/25 12:48:06 http: TLS handshake error from 10.42.3.0:46958: EOF

Caph controller version: v1.0.0-beta.19 Logs:

{"level":"INFO","time":"2023-08-25T12:46:44.877Z","file":"controllers/hetznercluster_controller.go:110","message":"Cluster Controller has not yet set OwnerRef","controller":"hetznercluster","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerCluster","HetznerCluster":{"name":"talos-cluster","namespace":"default"},"namespace":"default","name":"talos-cluster","reconcileID":"2040d212-6038-49e8-b4fe-6d309c556c2c","HetznerCluster":{"name":"talos-cluster","namespace":"default"},"Cluster":{"name":""}}
{"level":"INFO","time":"2023-08-25T12:46:48.535Z","file":"controllers/hcloudmachinetemplate_controller.go:66","message":"HCloudMachineTemplate is missing cluster label or cluster does not exist default/talos-cluster-control-plane","controller":"hcloudmachinetemplate","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HCloudMachineTemplate","HCloudMachineTemplate":{"name":"talos-cluster-control-plane","namespace":"default"},"namespace":"default","name":"talos-cluster-control-plane","reconcileID":"46a5bb12-90b2-4098-a20c-6f97278297a2","HCloudMachineTemplate":{"name":"talos-cluster-control-plane","namespace":"default"}

Cabpt controller version: v0.6.1 Logs:

2023-08-25T13:01:49Z    INFO    controllers.TalosConfig.cabpt-controller.namespace=default.talosconfig=talos-cluster-control-plane-6x28h.owner-name=talos-cluster-control-plane-jwrj5   updating talosconfig    {"endpoints": ["168.119.55.198", "2a01:4f8:c012:1c95::1"], "secret": "talos-cluster-talosconfig"}
2023-08-25T13:01:59Z    INFO    controllers.TalosConfig.cabpt-controller.namespace=default.talosconfig=talos-cluster-control-plane-6x28h.owner-name=talos-cluster-control-plane-jwrj5   ignoring an already ready config
2023-08-25T13:01:59Z    INFO    controllers.TalosConfig.cabpt-controller.namespace=default.talosconfig=talos-cluster-control-plane-6x28h.owner-name=talos-cluster-control-plane-jwrj5   updating talosconfig    {"endpoints": ["168.119.55.198", "2a01:4f8:c012:1c95::1"], "secret": "talos-cluster-talosconfig"}
2023-08-25T13:04:50Z    INFO    controllers.TalosConfig.cabpt-controller.namespace=default.talosconfig=talos-cluster-control-plane-6x28h.owner-name=talos-cluster-control-plane-jwrj5   ignoring an already ready config
2023-08-25T13:04:50Z    INFO    controllers.TalosConfig.cabpt-controller.namespace=default.talosconfig=talos-cluster-control-plane-6x28h.owner-name=talos-cluster-control-plane-jwrj5   updating talosconfig    {"endpoints": ["168.119.55.198", "2a01:4f8:c012:1c95::1"], "secret": "talos-cluster-talosconfig"}

Cacppt version: v0.5.2 Logs:

2023-08-25T13:08:28Z    INFO    reconcile TalosControlPlane {"controller": "taloscontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "TalosControlPlane", "TalosControlPlane": {"name":"talos-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "talos-cluster-control-plane", "reconcileID": "96fb04ed-8e3a-446e-8b85-74096f9bce67", "cluster": "talos-cluster"}
2023-08-25T13:08:28Z    INFO    controllers.TalosControlPlane   verifying etcd health on all nodes  {"node": "talos-cluster-control-plane-jwrj5"}
2023-08-25T13:08:28Z    INFO    controllers.TalosControlPlane   attempting to set control plane status
2023-08-25T13:08:39Z    INFO    controllers.TalosControlPlane   failed to get kubeconfig for the cluster    {"error": "failed to create cluster accessor: error creating client for remote cluster \"default/talos-cluster\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: client rate limiter Wait returned an error: context deadline exceeded - error from a previous attempt: EOF", "errorVerbose": "failed to get API group resources: unable to retrieve the complete list of server APIs: v1: client rate limiter Wait returned an error: context deadline exceeded - error from a previous attempt: EOF\nerror creating client for remote cluster \"default/talos-cluster\": error getting rest mapping\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).createClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:396\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).newClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:299\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:273\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:180\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).updateStatus\n\t/src/controllers/taloscontrolplane_controller.go:562\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile.func1\n\t/src/controllers/taloscontrolplane_controller.go:155\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile\n\t/src/controllers/taloscontrolplane_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598\nfailed to create cluster accessor\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:275\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:180\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).updateStatus\n\t/src/controllers/taloscontrolplane_controller.go:562\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile.func1\n\t/src/controllers/taloscontrolplane_controller.go:155\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile\n\t/src/controllers/taloscontrolplane_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598"}
2023-08-25T13:08:39Z    INFO    controllers.TalosControlPlane   successfully updated control plane status   {"namespace": "default", "talosControlPlane": "talos-cluster-control-plane", "cluster": "talos-cluster"}
2023-08-25T13:08:39Z    INFO    reconcile TalosControlPlane {"controller": "taloscontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "TalosControlPlane", "TalosControlPlane": {"name":"talos-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "talos-cluster-control-plane", "reconcileID": "37eba9e8-0f3b-4958-9a42-443fb9cb57d0", "cluster": "talos-cluster"}
2023-08-25T13:08:39Z    INFO    controllers.TalosControlPlane   verifying etcd health on all nodes  {"node": "talos-cluster-control-plane-jwrj5"}
2023-08-25T13:08:39Z    INFO    controllers.TalosControlPlane   attempting to set control plane status

Talos logs: Screenshot from 2023-08-25 14-21-44

What did you expect to happen: A working cluster that can be reached with kubectl and talosctl on the loadbalancer's public IP.

Environment:

lieberlois commented 1 year ago

@tmojzes Did you get this running?

tmojzes commented 1 year ago

@lieberlois Unfortunately not, I have tried today with the latest version of the providers but failed like before. Have you tried it yourself?

lieberlois commented 1 year ago

@tmojzes Tried it aswell, didnt work. I also didnt get other bootstrap providers (k3s in my case) running with this hetzner infrastructure provider

guettli commented 8 months ago

Unofficial feedback from me (Syself employee): We currently see no benefit in supporting Talos. I personally like it, but overall we are happy with kubeadm and debian/ubuntu based images.

guettli commented 3 months ago

Dear Talos friends. Feel free to create a new project at Github which explains how to use caph together with Talos. We (Syself) won't invest time in the next months.

Remember: "Yes" is forever, and "no" is temporary.

If you provide good docs how to do that, then we might switch.

Afaik the Go-code of caph does not need to be changes to support Talos.

If you have particular issues with using talos bootstrap provider together with caph, then please open a new issue. Thank you.