siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.39k stars 514 forks source link

Can't apply new config with changes on etcd to worker nodes #8465

Open vhurtevent opened 5 months ago

vhurtevent commented 5 months ago

Bug Report

Description

On an already deployed cluster with same cluster config to all nodes (control plane, workers), and specific machine config for each, I can't apply new config about etcd on all nodes, even if I know it's not usefull for workers.

My Terraform code is written to apply cluster config to all nodes through terraform-provider-talos When I make changes about scheduler or controller manager configs, they're applied successfully to all nodes but with etcd config keys I got the error message :

key/cert combination should not be empty

When I try to manually apply the change using talosctl edit machineconfig, I get the same error message.

Is it a bug ? Or do I need to refactorize my Terraform code to apply cluster config to control plane nodes only ?

Environment

smira commented 5 months ago

Or do I need to refactorize my Terraform code to apply cluster config to control plane nodes only ?

yes, exactly!

vhurtevent commented 5 months ago

Hello @smira !

thank you for your quick answer. I'll change my TF code.

For my information, what is the root cause ? is it about key/cert of etcd CA which do not exist on worker nodes ?

frezbo commented 5 months ago

Hello @smira !

thank you for your quick answer. I'll change my TF code.

For my information, what is the root cause ? is it about key/cert of etcd CA which do not exist on worker nodes ?

yes, cluster.etcd is not a valid field for worker nodes, Talos v1.7 has a better error message

vhurtevent commented 5 months ago

Just to share : I just remove etcd keys to apply them only to control plane, as I figure out that others cluster config keys are usefull even for worker nodes, such as

cluster:
    proxy:
      disabled: true

In our initial config, kube-proxy disabled and full CNI Cilium, removing this config key from worker nodes breaks networking between pods/services.

Is there any documentation about config keys per node kind ?

CompPhy commented 5 months ago

I just ran into the same problem and it took me way to long to get here.... It seems this error also prevents it from parsing the guestinfo.talos.config correctly during the boot process.

In my case, I'm trying to bootstrap a new VMWare worker node and it was doing a bunch of weird things on bootstrap. For example, I'm trying to set static IP assignments and this error was causing it to default to DHCP. It also wasn't picking up the hostname field.

I thought the problem was network related somehow and have been poking at govc for hours. After following the instructions here and updating my machineconfig, everything is working properly at boot.

I will note, this changed somewhere between releases, but I didn't see any documentation about it. We originally started this cluster on 1.4 release, and are now up to 1.6 release. We had templates saved from the original 1.4 install and it did not see this behaviour previously. It only showed up when trying to bootstrap directly into 1.6.